Parallelization problem in LinearSolve and Minimize

Question

My CPU has got 8 cores (it is Intel Core i7-2600 3.40 GHz). When I try to solve a linear matrix equation using LinearSolve for large matrices, Mathematica just uses 4 cores to solve the problem (CPU usage will be 50%). So it means that there is a problem in the Parallel computation options. When I try to Minimize a huge function with several variables, it is even worse and Mathematica just uses one core (CPU usage is about 12%)!!

I am not very familiar with Parallel computation, so will be very appreciative if you can help me to solve my problem. How can I use all capacity of the CPU when I run LinearSolve for very large matrices and Minimize for huge functions and make CPU usage 100% to make the computation time as short as possible on my machine??

Thank you very much.

**

Edit 1:

**

My computer for this analysis:

t = AbsoluteTime[];
primelist = Table[Prime[k], {k, 1, 20000000}];
time2 = AbsoluteTime[] - t

Yields a load of 12% on my CPU and time2=43.37 and by breaking this analysis into 8 cores:

t = AbsoluteTime[];
job1 = ParallelSubmit[Table[Prime[k], {k, 1, 2500000}]];
job2 = ParallelSubmit[Table[Prime[k], {k, 2500001, 5000000}]];
job3 = ParallelSubmit[Table[Prime[k], {k, 5000001, 7500000}]];
job4 = ParallelSubmit[Table[Prime[k], {k, 7500001, 10000000}]];
job5 = ParallelSubmit[Table[Prime[k], {k, 10000001, 12500000}]];
job6 = ParallelSubmit[Table[Prime[k], {k, 12500001, 15000000}]];
job7 = ParallelSubmit[Table[Prime[k], {k, 15000001, 17500000}]];
job8 = ParallelSubmit[Table[Prime[k], {k, 17500001, 20000000}]];
{a1, a2, a3, a4, a5, a6, a7, a8} = 
 WaitAll[{job1, job2, job3, job4, job5, job6, job7, job8}];
time2 = AbsoluteTime[] - t

Yields 100% load on CPU and time2=17.16

**

Edit 2:

**

To make it completely clear what is happening on my computer and what is my problem, please have a look at the following examples:

First if I want to check number of processors and kernels on my machine:

$ProcessorCount
$KernelCount

The results are 4 and 8 respectively on my machine.

Now if I want to see MKL conditions on my machine and also how much "CPU usage" reported by the system monitor correspond to actual performance, I can run this in Mathematica:

Clear["Global`*"]; a = RandomReal[{1, 2}, {20000, 20000}]; b = RandomReal[{1}, {20000}];
Table[SetSystemOptions["MKLThreads" -> i]; Print["Case=", i]; Print[SystemOptions["MKLThreads"]]; t = AbsoluteTime[]; LinearSolve[a, b]; time2 = AbsoluteTime[] - t; Print["t(", i, ")=", time2]; Print["******"], {i, 4}];

You can see the result including number of MKL threads and computation time for each case below:

Case=1
{MKLThreads->1}
t(1)=202.9560000
******
Case=2
{MKLThreads->2}
t(2)=120.3696000
******
Case=3
{MKLThreads->3}
t(3)=93.5532000
******
Case=4
{MKLThreads->4}
t(4)=88.5300000
******

While the CPU usage for Case1=12%, Case2=25%, Case3=37% and Case4=50%, reported by the system monitor. You can see that in this case "CPU usage" reported by the system monitor correspond to actual performance and the more CPU usage we observe, the less computation time we have.

Now if I increase the number of MKLThreads in SetSystemOptions["MKLThreads" -> ?] to values more than 4 (I mean 5 to 8), I can see that it doesn't have any effect on compuation time and CPU usage. The same thing happens if I change the number of ParallelThreadNumber in SetSystemOptions["ParallelOptions" -> {"ParallelThreadNumber" -> ?}], means that the computation time and CPU usage in this case do not depend on the ParallelThreadNumber. You can see the cases below:

SetSystemOptions["ParallelOptions" -> {"ParallelThreadNumber" -> 1}];
Print[SystemOptions["ParallelOptions" -> "ParallelThreadNumber"]];
SetSystemOptions["MKLThreads" -> 8]; Print[SystemOptions["MKLThreads"]];
t=AbsoluteTime[];
LinearSolve[a, b];
time2=AbsoluteTime[] - t;
Print["t=", time2];

The result is (CPU usage=50% during analysis):

{ParallelOptions->{ParallelThreadNumber->1}}
{MKLThreads->4}
t=85.3008000

And for other case:

SetSystemOptions["ParallelOptions" -> {"ParallelThreadNumber" -> 8}];
Print[SystemOptions["ParallelOptions" -> "ParallelThreadNumber"]];
SetSystemOptions["MKLThreads" -> 8]; Print[SystemOptions["MKLThreads"]];
t=AbsoluteTime[];
LinearSolve[a, b];
time2=AbsoluteTime[] - t;
Print["t=", time2];

The result is (Again CPU usage=50% during analysis):

{ParallelOptions->{ParallelThreadNumber->8}}
{MKLThreads->4}
t=85.3476000

As you can see, the CPU usage and computation time do not change when I increase MKLThreads to more than 4 (e.g 5 to 8) and they are also independent of the ParallelThreadNumber.

Another interesting example is about the case I mentioned in edit 1. Please have a look at these examples and results and CPU usage for each case:

1)

Clear["Global`*"];
t = AbsoluteTime[];
primelist = Table[Prime[k], {k, 1, 20000000}];
time2 = AbsoluteTime[] - t

Result: time2=43.37 and CPU usage=12%

2)

Clear["Global`*"];
t = AbsoluteTime[];
job1 = ParallelSubmit[Table[Prime[k], {k, 1, 10000000}]];
job2 = ParallelSubmit[Table[Prime[k], {k, 10000001, 20000000}]];
{a1, a2} = WaitAll[{job1, job2}];
time2 = AbsoluteTime[] - t

Result: time2=30.01 and CPU usage=25%

3)

Clear["Global`*"];
t = AbsoluteTime[];
job1 = ParallelSubmit[Table[Prime[k], {k, 1, 6666666}]];
job2 = ParallelSubmit[Table[Prime[k], {k, 6666667, 13333332}]];
job3 = ParallelSubmit[Table[Prime[k], {k, 13333333, 20000000}]];
{a1, a2, a3} = WaitAll[{job1, job2, job3}];
time2 = AbsoluteTime[] - t

Result: time2=23.46 and CPU usage=37%

4)

Clear["Global`*"];
t = AbsoluteTime[];
job1 = ParallelSubmit[Table[Prime[k], {k, 1, 5000000}]];
job2 = ParallelSubmit[Table[Prime[k], {k, 5000001, 10000000}]];
job3 = ParallelSubmit[Table[Prime[k], {k, 10000000, 15000000}]];
job4 = ParallelSubmit[Table[Prime[k], {k, 15000001, 20000000}]];
{a1, a2, a3, a4} = WaitAll[{job1, job2, job3, job4}];
time2 = AbsoluteTime[] - t

Result: time2=21.52 and CPU usage=50%

5)

Clear["Global`*"];
t = AbsoluteTime[];
job1 = ParallelSubmit[Table[Prime[k], {k, 1, 3333333}]];
job2 = ParallelSubmit[Table[Prime[k], {k, 3333334, 6666666}]];
job3 = ParallelSubmit[Table[Prime[k], {k, 6666667, 9999999}]];
job4 = ParallelSubmit[Table[Prime[k], {k, 10000000, 13333333}]];
job5 = ParallelSubmit[Table[Prime[k], {k, 13333334, 16666666}]];
job6 = ParallelSubmit[Table[Prime[k], {k, 16666667, 20000000}]];
{a1, a2, a3, a4, a5, a6} = WaitAll[{job1, job2, job3, job4, job5, job6}];
time2 = AbsoluteTime[] - t

Result: time2=18.28 and CPU usage=75%

6)

Clear["Global`*"];
t = AbsoluteTime[];
job1 = ParallelSubmit[Table[Prime[k], {k, 1, 2500000}]];
job2 = ParallelSubmit[Table[Prime[k], {k, 2500001, 5000000}]];
job3 = ParallelSubmit[Table[Prime[k], {k, 5000001, 7500000}]];
job4 = ParallelSubmit[Table[Prime[k], {k, 7500001, 10000000}]];
job5 = ParallelSubmit[Table[Prime[k], {k, 10000001, 12500000}]];
job6 = ParallelSubmit[Table[Prime[k], {k, 12500001, 15000000}]];
job7 = ParallelSubmit[Table[Prime[k], {k, 15000001, 17500000}]];
job8 = ParallelSubmit[Table[Prime[k], {k, 17500001, 20000000}]];
{a1, a2, a3, a4, a5, a6, a7, a8} = 
 WaitAll[{job1, job2, job3, job4, job5, job6, job7, job8}];
time2 = AbsoluteTime[] - t

Result: time2=17.16 and CPU usage=100%

But if I make this analysis using ParallelTable, interestingly the CPU usage is 100%, but computation time is 45.81!!! It means that computation time is quite the same with number 1 when I do this analysis with Table on one core (CPU usage=12%)!!

t = AbsoluteTime[];
primelist = ParallelTable[Prime[k], {k, 1, 20000000}];
time2 = AbsoluteTime[] - t

Result: time2=45.81 and CPU usage=100%

I also checked NMinimize for my big function with 75 (or more variables) using all methods available in Mathematica including Automatic, DifferentialEvolution, NelderMead, RandomSearch, and SimulatedAnnealing. The computation time for all of them is quite the same and CPU usage for all methods is only 12%. So it looks that minimization method I use in NMinimize cannot change parallelization conditions.

Now I think my conditions and problems are completely clear, so I would be very appreciative if someone can help me to use all capacity of my CPU in LinearSolve and NMinimize (or Minimize). I still wonder how I can make CPU usage in these cases 100%. In this way we can check whether CPU usage corresponds to actual perfomanse (like what we could see in the examples mentioned above) for LinearSolve and NMinimize or not??

Thank you very much.

**

Edit 3

**

The function I am trying to minimize is a large function including many variables. The general format of the function is sth like this:

(15 (-2.14286*10^-8 Log[E^(-1.*10^6 phi2[2]) + E^(1.*10^6 phi2[2])] + uu1[2]))^2 + 225 (2.14286*10^-8 Log[E^(-1.*10^6 phi2[2]) + E^(1.*10^6 phi2[2])] -2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[2] + phi2[3])) + E^(1.*10^6 (-phi2[2] + phi2[3]))] -2 uu1[2] + uu1[3])^2 + 225 (2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[2] + phi2[3])) + E^(1.*10^6 (-phi2[2] + phi2[3]))] -2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[3] + phi2[4])) + E^(1.*10^6 (-phi2[3] + phi2[4]))] + uu1[2] - 2 uu1[3] + uu1[4])^2 + 225 (2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[3] + phi2[4])) + E^(1.*10^6 (-phi2[3] + phi2[4]))] - 2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[4] + phi2[5])) + E^(1.*10^6 (-phi2[4] + phi2[5]))] + uu1[3] - 2 uu1[4] + uu1[5])^2 + 225 (-2.14286*10^-8 Log[E^(-1.*10^6 phi2[6]) + E^(1.*10^6 phi2[6])]+ 2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[5] + phi2[6])) + E^(1.*10^6 (-phi2[5] + phi2[6]))] + uu1[5] - 2 uu1[6])^2 + 225 (2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[4] + phi2[5])) + E^(1.*10^6 (-phi2[4] + phi2[5]))] - 2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[5] + phi2[6])) + E^(1.*10^6 (-phi2[5] + phi2[6]))] + uu1[4] - 2 uu1[5] + uu1[6])^2 + ((15 (2.14286*10^-8 Log[E^(-1.*10^6 phi2[6]) + E^(1.*10^6 phi2[6])] + uu1[6]))^2) + (0.00918367 phi2[2] + (-0.00175179 - 11/112 (0.007848)) (1 - (1.*10^12 (phi2[2])^2)/(Log[E^(-1.*10^6 phi2[2]) + E^(1.*10^6 phi2[2])])^2) + ...

Where uu1[i], uu3[i] and phi2[i] are variables. The issue is that the number of variables can increase to a large number (for example 5000 or even more) which make the function hugely big!! So if I cannot use all capacity of the CPU it takes maybe days to minimize such a function, even though one computer with full capacity of the CPU is not enough to solve such a problem too, but the first step is to learn how to configure parallelization for NMinimize (or FindRoot) on a single machine to be able to extend it to parallelization on several remote machines.

Edit 4:

An example of the complete form of the function with 75 variables is:

Pastebin link

Where variables (unknown parameters) are:

{uu1[2], uu3[2], phi2[2], uu1[3], uu3[3], phi2[3], uu1[4], uu3[4], phi2[4], uu1[5], uu3[5], phi2[5], uu1[6], uu3[6], phi2[6], uu1[7], uu3[7], phi2[7], uu1[8], uu3[8], phi2[8], uu1[9], uu3[9], phi2[9], uu1[10], uu3[10], phi2[10], uu1[11], uu3[11], phi2[11], uu1[12], uu3[12], phi2[12], uu1[13], uu3[13], phi2[13], uu1[14], uu3[14], phi2[14], uu1[15], uu3[15], phi2[15], uu1[16], uu3[16], phi2[16], uu1[17], uu3[17], phi2[17], uu1[18], uu3[18], phi2[18], uu1[19], uu3[19], phi2[19], uu1[20], uu3[20], phi2[20], uu1[21], uu3[21], phi2[21], uu1[22], uu3[22], phi2[22], uu1[23], uu3[23], phi2[23], uu1[24], uu3[24], phi2[24], P1, F1, M1, PN, FN, MN}

I know this function is extremely instable, but the optimum point of the function is also known which is equal to zero, so I am trying to find the values of the parameters which make the function minimum (zero). The parameters which can make the whole function as small as possible are the best answers.

**

Edit 5

**

Thank you all guys for your helpful comments. Based on what KAI and Oleksandr R. mentioned, and as far as I could understand, LinearSolve uses all capacity of the CPU cores to solve the equation involving large matrices. Consequently, it seems that if I want to solve some linear matrix equations for a few times, the best method is to solve each of them one by one in a LOOP to make it most efficient. In this way Mathemathica is able to use all capacity of CPU cores in each step and solve the equation in the most efficient way (in each step) and goes to the next step. But if you have a look at these 2 examples, apparently it is not like this and Parallelization forces Mathematica to solve the problem involving LinearSolve in a way that is likely more efficient and faster. You can check these examples on your computer. Based on the comments we had here, I am wondering how we can explain these examples.

Example 1:

Clear["Global`*"];
t = AbsoluteTime[];
NN = 8;
CC = Array[cc, NN];
For[i = 1, i < (NN + 1), i++,
  Clear[a, b];
  a = RandomReal[{i, i + 1}, {6000, 6000}];
  b = RandomReal[{i}, {6000}];
  CC[[i]] = LinearSolve[a, b];
  ];
time2 = AbsoluteTime[] - t

For example 1 CPU usage is 50% and time2=23.4

Example 2:

Clear["Global`*"];
t = AbsoluteTime[];
job1 = ParallelSubmit[a1 = RandomReal[{1, 2}, {6000, 6000}]; b1 = RandomReal[{1}, {6000}]; c1 = LinearSolve[a1, b1]];
job2 = ParallelSubmit[a2 = RandomReal[{2, 3}, {6000, 6000}]; b2 = RandomReal[{2}, {6000}]; c2 = LinearSolve[a2, b2]];
job3 = ParallelSubmit[a3 = RandomReal[{3, 4}, {6000, 6000}]; b3 = RandomReal[{3}, {6000}]; c3 = LinearSolve[a3, b3]];
job4 = ParallelSubmit[a4 = RandomReal[{4, 5}, {6000, 6000}]; b4 = RandomReal[{4}, {6000}]; c4 = LinearSolve[a4, b4]];
job5 = ParallelSubmit[a5 = RandomReal[{5, 6}, {6000, 6000}]; b5 = RandomReal[{5}, {6000}]; c5 = LinearSolve[a5, b5]];
job6 = ParallelSubmit[a6 = RandomReal[{6, 7}, {6000, 6000}]; b6 = RandomReal[{6}, {6000}]; c6 = LinearSolve[a6, b6]];
job7 = ParallelSubmit[a7 = RandomReal[{7, 8}, {6000, 6000}]; b7 = RandomReal[{7}, {6000}]; c7 = LinearSolve[a7, b7]];
job8 = ParallelSubmit[a8 = RandomReal[{8, 9}, {6000, 6000}]; b8 = RandomReal[{8}, {6000}]; c8 = LinearSolve[a8, b8]];
{R1, R2, R3, R4, R5, R6, R7, R8} = WaitAll[{job1, job2, job3, job4, job5, job6, job7, job8}];
time2 = AbsoluteTime[] - t

For example 2 CPU usage=100% and time2=19.8

When I check from Evaluation\Parallel Kernel Configuration\Parallel; The limit by license availability is 16. — mak maak, Sep 23 '12 at 11:42
NMinimize might not be parallelized depending on the problem you are trying to solve. Have you tried explicitly something like ParallelTable to see what load you get then? For instance (rather surprizingly) ParallelTable[Pause[20], {i, 8}] yields a load of 87 % on my 2.5 Ghz i7 — chris, Sep 23 '12 at 13:11
@makmaak, do you have 8 cores or 4 plus hyperthreading? If that is the case the MKL will not only use the true number of cores for performance reasons. — , Sep 23 '12 at 14:26
@ruebenko, I am not sure but I think It is 4 cores plus hyperthreading. If it is the case so how I can use the capacity of my CPU to solve the problems I mentioned??! My problem for NMinimize is a large function with 75 variables, and I want to find the minimum of this function. Mathematica can minimize this function and give me the result but it takes too much time, because just uses one core (CPU usage=12%). So I am trying to employ 8 cores to make computation time as short as possible. It sounds very wierd to me if it is not possible in a software like Mathematica! — mak maak, Sep 23 '12 at 14:58
Can you divide the area over which you want to minimize in 8 parts? If so, you could try ParallelTable to start parallel evaluations in all those areas. Pick the lowest result from the table returned. — Sjoerd C. de Vries, Sep 23 '12 at 15:05
@chris, please have a look at edit 1 for my question to see result of explicit analysis. If NMinimize cannot be parallelized, so how I can use all capacity of my CPU to minimize this large function??! — mak maak, Sep 23 '12 at 15:30
@Sjoerd C. de Vries, unfortunately my problem is an Unconstrained Optimization case — mak maak, Sep 23 '12 at 15:46
4 cores plus SMT is still 4 cores, so you'll only suffer by running 8 threads of LinearSolve on a 4-core CPU. For other types of workload that aren't as well optimized it isn't always so clear, but you can try SetSystemOptions["ParallelOptions" -> "ParallelThreadNumber" -> 8], which I found gave a boost of about 20% in this question, which also relates to NMinimize. Don't expect miracles from SMT; "CPU usage" as reported by the system monitor usually does not correspond in any direct way to actual performance. — Oleksandr R., Sep 23 '12 at 17:51
Also, what minimization method are you using? Not all methods are particularly amenable to parallelization, even in principle. In those cases you are basically stuck with domain decomposition as suggested by Sjoerd. — Oleksandr R., Sep 23 '12 at 18:05
@mak maak just for your information you can do t = AbsoluteTime[]; primelist = ParallelTable[Prime[k], {k, 1, 20000000}]; time2 = AbsoluteTime[] - t in one command so to speak. It does it in 60 sec on my laptop. — chris, Sep 23 '12 at 19:45
@makmaak, the fact that you only see 4 CPUs active in LinearSolve may be because the MKL detects that HT is on and does not use it. In my experiance HT may be beneficial if the tasks done use different part of the CPU, which is not the case for numerics. Since I do a lot of numerics, I usually switch HT off altogether. Concerning NMinimize it would be good to see the problem at hand, perhaps something can be done. — , Sep 24 '12 at 02:43
@Oleksandr R., Please have a look at edit 2 I made for my question. It looks that "CPU usage" reported by the system monitor is not very inconsistent with actual performance. What I am looking for is just to make sure that I am using all capacity of my CPU to make computation time as short as possible. — mak maak, Sep 24 '12 at 06:50
@chris, Please have a look at edit 2 I made for my question. It looks that the performance of the method you suggested is completely different from what I used and computation time increases too much if we use what you mentioned. You can check it on your laptop. — mak maak, Sep 24 '12 at 06:52
Thanks for the update. Okay, things are getting complicated now, so let me try to clarify. First thing to note is that these workloads (LinearSolve, NMinimize, and tabulating primes) have very different performance characteristics, so your conclusions in each case are not transferable to the others. For LinearSolve, you actually are using all of your CPU's resources, even though the CPU monitor claims 50% usage. Tabulating primes is "embarrassingly parallel" and can benefit from SMT (Hyper-threading), so ... — Oleksandr R., Sep 24 '12 at 07:13
... in that case, one can load up the CPU to 100% of its front end/scheduler capacity and see a performance gain by doing so. To understand this you need to know how and why SMT works, which is quite a technical topic. Jon Stokes has given a very good introduction that I would suggest as a starting point. Focusing on your actual problem, i.e. NMinimize, basically the current implementation is serial only, but the algorithms it uses are parallelizable in principle to some extent, which is why I asked which of the algorithms ... — Oleksandr R., Sep 24 '12 at 07:17
... you're using in this case. As you saw from the link above, I've implemented an improved version of the Nelder-Mead algorithm, which (the algorithm itself, I mean) is not inherently parallel as it stands but can be adapted to benefit from parallelism. The other algorithms are inherently parallel (which means you can write a parallel implementation, not that NMinimize itself is parallelizable) but may or may not be appropriate for any given problem. Essentially the one thing that will really help here is more detail about the function you are actually minimizing. — Oleksandr R., Sep 24 '12 at 07:21
@OleksandrR. do you understand why ParallelTable performs more poorly than ParallelSubmit for the primes? — chris, Sep 24 '12 at 07:48
@ruebenko MKL has its own thread scheduler, which is pretty good at making the optimal choices. I agree with what you wrote, but for non-MKL code you might like to reconsider switching HT off. In practice the situation is a lot more complicated than whether all the execution resources are used (for almost all code--even most HPC-like floating point workloads--they aren't). The HT implementation in recent CPUs is much improved versus a decade ago, and you'll now find only rare cases for which HT actually reduces performance. Very little code comes anywhere close to MKL in terms of optimization. — Oleksandr R., Sep 24 '12 at 08:11
@chris it's complicated... basically Prime is stateful; it matters which arguments have been passed already. You can get improved performance by blocking, either using ParallelSubmit or the Method -> "CoarsestGrained" option of ParallelTable. More on Prime here. — Oleksandr R., Sep 24 '12 at 08:16
@Oleksandr R., Thank you for your comments. The function I am trying to minimize using NMinimize is a very large function of 75 (or more variables) including quadratics terms and exponential terms. One more thing is that I have same problem with FinRoot command in Mathematica. I mean I can do my optimization using NMinimize (or Minimize) to find the minimum of my big function OR solve a large system of non-linear equations including 75 (or more) non-linear equations and 75 (or more variables), but in FindRoot I have same problem and cannot make CPU usage 100%. — mak maak, Sep 24 '12 at 11:14
@Oleksandr R., But I am still confused about what you mentioned for LinearSolve. You said I am actually using all CPU's resources, even though the CPU monitor claims 50% usage. But as you can see in the first example of Edit2 in my question, increasing CPU usage results in shorter computation time, so when increasing CPU usage from 12% in case1 to 50% in case4 can decrease computation time from 203 to 89, I wonder why we cannot decrease computation time more by making CPU usage 100% (if it is possible)?? — mak maak, Sep 24 '12 at 11:20
@Oleksandr R., I am not sure, but maybe you mean for LinearSolve 50% usage of CPU means using all CPU resources and CPU usage less than 50% means not using all CPU resources, which still sounds confusing to me! — mak maak, Sep 24 '12 at 11:23
Yes, that's exactly what I mean. Please read Jon Stokes's article, which explains how SMT works and why this is the case. I'm sorry if this is confusing but this is a fairly technical subject and it will be difficult to understand until you have some quite detailed knowledge about the design of modern CPUs, which it isn't practical to review here. Anyway, please let's focus on the actual problem and actual performance: a 75-dimensional function is really not that large a problem, and should not take very long to minimize. If you can remove irrelevant material about LinearSolve and ... — Oleksandr R., Sep 24 '12 at 12:15
... tabulating primes, and instead focus on the performance of the minimization without getting distracted by the specific value reported for CPU usage, it will be much easier to contemplate giving you a practical answer. Please describe the actual function in as much detail as possible (or post it in its entirety) because without that information anything that can be said is pure speculation. — Oleksandr R., Sep 24 '12 at 12:17
@OleksandrR., I was refering to MKL related stuff only. In V8.0.3 The MKL version is <= 10.3.5. Intel it self suggests switching HT off for MKL (BLAS) related operations (see here and here Also, to the best of my knowledge automatic selection is only available on windows here - So IMHO optimal LinearSolve optimal performance is by not using HT and (maybe) setting thread affinity. — , Sep 24 '12 at 13:05
@Oleksandr R., Thank you for your help, please have a look at edit 3 in the question. — mak maak, Sep 24 '12 at 14:25
Thanks for the extra information. Well, a 75-dimensional function is not a serious problem, but a 5000-dimensional one might be. Is this a convex function? If so, you can most likely use FindMinimum with the "ConjugateGradient" or "QuasiNewton" (L-BFGS) methods. If it's only slightly non-convex, try my Nelder-Mead code (not the implementation offered by NMinimize, which is much slower). If it's severely non-convex and/or possesses many local minima, you might have serious problems; differential evolution is one possible (parallelizable) option, but the algorithm is not very efficient. — Oleksandr R., Sep 24 '12 at 14:43
@Oleksandr R., I think it is the third case, unfortunately! Could you please explain more what do you mean by parallelizable differential evolution method?? Thanx. — mak maak, Sep 24 '12 at 14:57
I should add that none of those options is a solution to your parallelization question, but I feel it's more important to discuss how to attack the problem before considering parallelization. An inefficient algorithm may take days to run in parallel while an efficient one (suited to the problem) might not even need to be parallelized at all to return a solution in reasonable time. — Oleksandr R., Sep 24 '12 at 15:00
Differential evolution is inherently parallelizable because each stage of the algorithm is quite simple and can be applied to the complete dataset in one step with no synchronization or dependency propagation necessary. This is not to say that NMinimize offers a parallel implementation of differential evolution--it doesn't, but it can be done. Could you please edit a full 75- (or more) dimensional example into your question? The 12-dimensional excerpt you supplied so far does not give any difficulties to FindMinimum, so it is difficult to determine where exactly your problems arise. — Oleksandr R., Sep 24 '12 at 15:09
@Oleksandr R., As I already mentioned, the problem arises when the number of variables increases! 75 is the minimum number (I mentioned 75 just to make my question not very complicated), it is usually about 5000 or more, but the general shape of the function is like what I edited. I didn't know how to copy the whole function from Mathematica notebook to this page, so just typed some terms from the beginning. — mak maak, Sep 24 '12 at 15:29
And as I already mentioned, if you don't provide a concrete example of your problem but just allude to it, it's very difficult to see where the real issues lie! I assume you are not typing in a 5000-dimensional function by hand. If you can't supply the function directly (75-dimensional, or whatever is the smallest that gives you difficulty), can you provide the code to generate it? Otherwise, use CopyToClipboard@InputForm[expr] to copy it. — Oleksandr R., Sep 24 '12 at 16:16
At least part of the problem appears to be related to the fact that the numeric values in your function as well as the optimum values of your parameters are all very small. This leads to severe numerical instability, especially since FindMinimum and NMinimize start with an initial guess of 1 for the value of each parameter: you are working with huge numbers whose logs are about 1 million and looking for small differences between these, which is unlikely to work out for obvious reasons. Perhaps you will have better luck if you first improve the stability characteristics of your function. — Oleksandr R., Sep 24 '12 at 18:55
@Oleksandr R., Thank you for your comments. Could you please have a look at edit 4. I agree with what you said regarding instability, but I am wondering how I can make this function stable. Moreover, even if we make it stable, I think it is better to implement NMinimize for the large stable function in Mathematica using more capacity of the CPU. — mak maak, Sep 26 '12 at 12:45
@Oleksandr R., I don't know why I cannot use your Nelder-Mead code to minimize my function. — mak maak, Sep 26 '12 at 15:16
@Oleksandr R., Could you please also have a look at Edit 5 in my question. Based on what you mentioned regarding LinearSolve, how we can explain these examples? Thanx. — mak maak, Sep 27 '12 at 04:19
I don't have a lot of time to answer questions at the moment as I'm trying to finish a PhD thesis, so I apologize if I can't keep up with all of your queries. On second thought, don't use my Nelder-Mead code for this; there are some minor issues for your function and anyway I don't think it will be as beneficial for this problem as it is in other cases. With regard to LinearSolve--honestly, you should split this question into several on the different (unrelated) topics you bring up--the timing difference is due to parallelization of the RandomReal calls which otherwise are single-threaded. — Oleksandr R., Sep 27 '12 at 13:38
As regards making your function more stable, you might start by replacing your many instances of Log[Exp[-x] + Exp[+x]] with a function that returns the same value without generating intermediate values that vary by hundreds of thousands of orders of magnitude. It will be highly preferable if you can keep all intermediates within the range limitations of machine precision numbers. — Oleksandr R., Sep 27 '12 at 13:43
Just FYI, such a function is (-(Abs[x]*(-1 + Tanh[8 - 4*Abs[x]])) + ((3*(25 + 4*Abs[x]^2 - 375/(15 + 4*Abs[x]^2)))/64 + Log[2])*(1 + Tanh[8 - 4*Abs[x]]))/2. This deviates from Log[Exp[-x] + Exp[+x]] by less than 1% and should suffice as a replacement. If you also scale your variables then we may be in business. By the way, do you need to find the minimum or will a value close to the minimum suffice? The problem is, I'm not sure if it's possible to do the former (or, if you do, to prove that you succeeded). The latter is difficult in a 5000-dimensional problem but not impossible. — Oleksandr R., Sep 27 '12 at 18:22
Oleksandr R., Thanks again for your help. When we replace Log[Exp[-x] + Exp[+x]] with a function including absolute value, like what you suggested, our target function becomes non-smooth, because of the absolute value, and, as far as I know, minimization of a non-smooth function is much more difficult. I need to find values which make the function zero, and this function absolutely has got such an answer. — mak maak, Sep 28 '12 at 13:29

KAI · Answer 1 · 2012-09-26T21:54:24.127

9

It appears that your chip only has 4 cores according to Intel. It can create 4 additional virtual cores with hyperthreading, but virtual cores and physical cores are not always equivalent. The statement that you have 8 cores is only true in some circumstances.

As your question indicates, LinearSolve uses the Intel MKL library. Intel indicates that hyperthreading will generally not be useful in this case. See the hyperthreading section at this website. This section is especially relevant:

Note: If the requested number of threads exceeds the number of physical cores (perhaps because of hyper-threading), and MKL_DYNAMIC is not changed from its default value of TRUE, Intel MKL will scale down the number of threads to the number of physical cores.

Therefore, Mathematica appears to be solving the problem in the most efficient way, and doesn't seem to want to be forced to solve the problem in a way that will likely be less efficient.

edited Sep 26 '12 at 21:54

answered Sep 26 '12 at 20:54

KAI

673
3
8

Thank you for you answer. Could you please have a look at Edit 5 in my question? It is not a contradiction with what you mentioned?? – mak maak Sep 27 '12 at 04:17
I don't get a performance improvement in Example 2 of Edit 5 when I go from 4 cores to 4 physical plus 4 virtual cores. My computer uses more resources (100% vs 50% of the CPU), but the Timing of the problem is basically the same. You might want to test to make sure that you are actually getting the improvement in timing because of the additional cores and not just from using a slightly different method. It looks like the latter might be the reason for the increased speed, but I'm not sure. – KAI Sep 27 '12 at 18:20

score 3 · Answer 2 · answered Sep 25 '12 at 21:02

This doesn't address the parallelization, but hopefully gives some indication of how even simple manipulation of the objective function might speed things up.

Here is (most of) the function from your "Edit 3", trimmed to remove unbalanced brackets:

f=(15 (-2.14286*10^-8 Log[E^(-1.*10^6 phi2[2])+E^(1.*10^6 phi2[2])]+uu1[2]))^2+225 (2.14286*10^-8 Log[E^(-1.*10^6 phi2[2])+E^(1.*10^6 phi2[2])]-2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[2]+phi2[3]))+E^(1.*10^6 (-phi2[2]+phi2[3]))]-2 uu1[2]+uu1[3])^2+225 (2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[2]+phi2[3]))+E^(1.*10^6 (-phi2[2]+phi2[3]))]-2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[3]+phi2[4]))+E^(1.*10^6 (-phi2[3]+phi2[4]))]+uu1[2]-2 uu1[3]+uu1[4])^2+225 (2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[3]+phi2[4]))+E^(1.*10^6 (-phi2[3]+phi2[4]))]-2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[4]+phi2[5]))+E^(1.*10^6 (-phi2[4]+phi2[5]))]+uu1[3]-2 uu1[4]+uu1[5])^2+225 (-2.14286*10^-8 Log[E^(-1.*10^6 phi2[6])+E^(1.*10^6 phi2[6])]+2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[5]+phi2[6]))+E^(1.*10^6 (-phi2[5]+phi2[6]))]+uu1[5]-2 uu1[6])^2+225 (2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[4]+phi2[5]))+E^(1.*10^6 (-phi2[4]+phi2[5]))]-2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[5]+phi2[6]))+E^(1.*10^6 (-phi2[5]+phi2[6]))]+uu1[4]-2 uu1[5]+uu1[6])^2+((15 (2.14286*10^-8 Log[E^(-1.*10^6 phi2[6])+E^(1.*10^6 phi2[6])]+uu1[6]))^2);

Extract the variables and test the timing of NMinimize:

vars = Cases[f, _[_Integer], -1] // Union;
NMinimize[f, vars] // Timing

(*  {5.039, {6.38213*10^-14, {phi2[2] -> -1.26493*10^-10, phi2[3] -> -1.30904*10^-10, 
    phi2[4] -> 5.5286*10^-11, phi2[5] -> 1.87167*10^-11, phi2[6] -> 6.97739*10^-12, 
    uu1[2] -> 5.3047*10^-9, uu1[3] -> 4.24388*10^-9, uu1[4] -> -1.37845*10^-13, 
    uu1[5] -> -4.24391*10^-9, uu1[6] -> -5.30474*10^-9}}}  *)

Okay, that took 5 seconds. Now try a simple rescaling of the variables:

g = ExpandAll[f /. {p_phi2 :> 10.^-6 p, u_uu1 :> 10.^-5 u}];
NMinimize[g, vars] // Timing

(*  {0.109, {6.39186*10^-14, {phi2[2] -> 0.0438337, phi2[3] -> 0.0778526, 
    phi2[4] -> 0.0688122, phi2[5] -> 0.0663468, phi2[6] -> 0.0552145, 
    uu1[2] -> 0.000531695, uu1[3] -> 0.000425536, uu1[4] -> -2.53421*10^-7, 
    uu1[5] -> -0.000426044, uu1[6] -> -0.000533116}}}  *)

That only took a tenth of a second. There is a factor of 50 speed increase just by rescaling the variables. I don't know how that might scale with a 75 or 5000 dimensional problem, but it does suggest that you would be better off improving the objective function, as Oleksandr said, than throwing more CPU cores at it.

Thank you for your reply. I agree with what you mentioned, this function is extremely instable and needs serious attention, but I am wondering how to do this. You actually changed the function completely in this example, while I need the original function to be minimized. Moreover, even after rescaling for this function, I think if we can use more CPU cores (when the number of variables is very large) it is better than using one core. If you like to check the complete form of the function for 75 dimensions, please have a look at Edit 4. — mak maak, Sep 26 '12 at 05:17
@makmaak, I'm afraid I don't know much about this sort of thing. I had a look at your full function in Edit 4, but couldn't find a good approach to improve the stability. Good luck with it... — Simon Woods, Sep 30 '12 at 14:38