My CPU has got 8 cores (it is Intel Core i7-2600 3.40 GHz). When I try to solve a linear matrix equation using LinearSolve for large matrices, Mathematica just uses 4 cores to solve the problem (CPU usage will be 50%). So it means that there is a problem in the Parallel computation options. When I try to Minimize a huge function with several variables, it is even worse and Mathematica just uses one core (CPU usage is about 12%)!!
I am not very familiar with Parallel computation, so will be very appreciative if you can help me to solve my problem. How can I use all capacity of the CPU when I run LinearSolve for very large matrices and Minimize for huge functions and make CPU usage 100% to make the computation time as short as possible on my machine??
Thank you very much.
**
Edit 1:
**
My computer for this analysis:
t = AbsoluteTime[];
primelist = Table[Prime[k], {k, 1, 20000000}];
time2 = AbsoluteTime[] - t
Yields a load of 12% on my CPU and time2=43.37 and by breaking this analysis into 8 cores:
t = AbsoluteTime[];
job1 = ParallelSubmit[Table[Prime[k], {k, 1, 2500000}]];
job2 = ParallelSubmit[Table[Prime[k], {k, 2500001, 5000000}]];
job3 = ParallelSubmit[Table[Prime[k], {k, 5000001, 7500000}]];
job4 = ParallelSubmit[Table[Prime[k], {k, 7500001, 10000000}]];
job5 = ParallelSubmit[Table[Prime[k], {k, 10000001, 12500000}]];
job6 = ParallelSubmit[Table[Prime[k], {k, 12500001, 15000000}]];
job7 = ParallelSubmit[Table[Prime[k], {k, 15000001, 17500000}]];
job8 = ParallelSubmit[Table[Prime[k], {k, 17500001, 20000000}]];
{a1, a2, a3, a4, a5, a6, a7, a8} =
WaitAll[{job1, job2, job3, job4, job5, job6, job7, job8}];
time2 = AbsoluteTime[] - t
Yields 100% load on CPU and time2=17.16
**
Edit 2:
**
To make it completely clear what is happening on my computer and what is my problem, please have a look at the following examples:
First if I want to check number of processors and kernels on my machine:
$ProcessorCount
$KernelCount
The results are 4 and 8 respectively on my machine.
Now if I want to see MKL conditions on my machine and also how much "CPU usage" reported by the system monitor correspond to actual performance, I can run this in Mathematica:
Clear["Global`*"]; a = RandomReal[{1, 2}, {20000, 20000}]; b = RandomReal[{1}, {20000}];
Table[SetSystemOptions["MKLThreads" -> i]; Print["Case=", i]; Print[SystemOptions["MKLThreads"]]; t = AbsoluteTime[]; LinearSolve[a, b]; time2 = AbsoluteTime[] - t; Print["t(", i, ")=", time2]; Print["******"], {i, 4}];
You can see the result including number of MKL threads and computation time for each case below:
Case=1
{MKLThreads->1}
t(1)=202.9560000
******
Case=2
{MKLThreads->2}
t(2)=120.3696000
******
Case=3
{MKLThreads->3}
t(3)=93.5532000
******
Case=4
{MKLThreads->4}
t(4)=88.5300000
******
While the CPU usage for Case1=12%, Case2=25%, Case3=37% and Case4=50%, reported by the system monitor. You can see that in this case "CPU usage" reported by the system monitor correspond to actual performance and the more CPU usage we observe, the less computation time we have.
Now if I increase the number of MKLThreads in SetSystemOptions["MKLThreads" -> ?] to values more than 4 (I mean 5 to 8), I can see that it doesn't have any effect on compuation time and CPU usage. The same thing happens if I change the number of ParallelThreadNumber in SetSystemOptions["ParallelOptions" -> {"ParallelThreadNumber" -> ?}], means that the computation time and CPU usage in this case do not depend on the ParallelThreadNumber. You can see the cases below:
SetSystemOptions["ParallelOptions" -> {"ParallelThreadNumber" -> 1}];
Print[SystemOptions["ParallelOptions" -> "ParallelThreadNumber"]];
SetSystemOptions["MKLThreads" -> 8]; Print[SystemOptions["MKLThreads"]];
t=AbsoluteTime[];
LinearSolve[a, b];
time2=AbsoluteTime[] - t;
Print["t=", time2];
The result is (CPU usage=50% during analysis):
{ParallelOptions->{ParallelThreadNumber->1}}
{MKLThreads->4}
t=85.3008000
And for other case:
SetSystemOptions["ParallelOptions" -> {"ParallelThreadNumber" -> 8}];
Print[SystemOptions["ParallelOptions" -> "ParallelThreadNumber"]];
SetSystemOptions["MKLThreads" -> 8]; Print[SystemOptions["MKLThreads"]];
t=AbsoluteTime[];
LinearSolve[a, b];
time2=AbsoluteTime[] - t;
Print["t=", time2];
The result is (Again CPU usage=50% during analysis):
{ParallelOptions->{ParallelThreadNumber->8}}
{MKLThreads->4}
t=85.3476000
As you can see, the CPU usage and computation time do not change when I increase MKLThreads to more than 4 (e.g 5 to 8) and they are also independent of the ParallelThreadNumber.
Another interesting example is about the case I mentioned in edit 1. Please have a look at these examples and results and CPU usage for each case:
1)
Clear["Global`*"];
t = AbsoluteTime[];
primelist = Table[Prime[k], {k, 1, 20000000}];
time2 = AbsoluteTime[] - t
Result: time2=43.37 and CPU usage=12%
2)
Clear["Global`*"];
t = AbsoluteTime[];
job1 = ParallelSubmit[Table[Prime[k], {k, 1, 10000000}]];
job2 = ParallelSubmit[Table[Prime[k], {k, 10000001, 20000000}]];
{a1, a2} = WaitAll[{job1, job2}];
time2 = AbsoluteTime[] - t
Result: time2=30.01 and CPU usage=25%
3)
Clear["Global`*"];
t = AbsoluteTime[];
job1 = ParallelSubmit[Table[Prime[k], {k, 1, 6666666}]];
job2 = ParallelSubmit[Table[Prime[k], {k, 6666667, 13333332}]];
job3 = ParallelSubmit[Table[Prime[k], {k, 13333333, 20000000}]];
{a1, a2, a3} = WaitAll[{job1, job2, job3}];
time2 = AbsoluteTime[] - t
Result: time2=23.46 and CPU usage=37%
4)
Clear["Global`*"];
t = AbsoluteTime[];
job1 = ParallelSubmit[Table[Prime[k], {k, 1, 5000000}]];
job2 = ParallelSubmit[Table[Prime[k], {k, 5000001, 10000000}]];
job3 = ParallelSubmit[Table[Prime[k], {k, 10000000, 15000000}]];
job4 = ParallelSubmit[Table[Prime[k], {k, 15000001, 20000000}]];
{a1, a2, a3, a4} = WaitAll[{job1, job2, job3, job4}];
time2 = AbsoluteTime[] - t
Result: time2=21.52 and CPU usage=50%
5)
Clear["Global`*"];
t = AbsoluteTime[];
job1 = ParallelSubmit[Table[Prime[k], {k, 1, 3333333}]];
job2 = ParallelSubmit[Table[Prime[k], {k, 3333334, 6666666}]];
job3 = ParallelSubmit[Table[Prime[k], {k, 6666667, 9999999}]];
job4 = ParallelSubmit[Table[Prime[k], {k, 10000000, 13333333}]];
job5 = ParallelSubmit[Table[Prime[k], {k, 13333334, 16666666}]];
job6 = ParallelSubmit[Table[Prime[k], {k, 16666667, 20000000}]];
{a1, a2, a3, a4, a5, a6} = WaitAll[{job1, job2, job3, job4, job5, job6}];
time2 = AbsoluteTime[] - t
Result: time2=18.28 and CPU usage=75%
6)
Clear["Global`*"];
t = AbsoluteTime[];
job1 = ParallelSubmit[Table[Prime[k], {k, 1, 2500000}]];
job2 = ParallelSubmit[Table[Prime[k], {k, 2500001, 5000000}]];
job3 = ParallelSubmit[Table[Prime[k], {k, 5000001, 7500000}]];
job4 = ParallelSubmit[Table[Prime[k], {k, 7500001, 10000000}]];
job5 = ParallelSubmit[Table[Prime[k], {k, 10000001, 12500000}]];
job6 = ParallelSubmit[Table[Prime[k], {k, 12500001, 15000000}]];
job7 = ParallelSubmit[Table[Prime[k], {k, 15000001, 17500000}]];
job8 = ParallelSubmit[Table[Prime[k], {k, 17500001, 20000000}]];
{a1, a2, a3, a4, a5, a6, a7, a8} =
WaitAll[{job1, job2, job3, job4, job5, job6, job7, job8}];
time2 = AbsoluteTime[] - t
Result: time2=17.16 and CPU usage=100%
But if I make this analysis using ParallelTable, interestingly the CPU usage is 100%, but computation time is 45.81!!! It means that computation time is quite the same with number 1 when I do this analysis with Table on one core (CPU usage=12%)!!
t = AbsoluteTime[];
primelist = ParallelTable[Prime[k], {k, 1, 20000000}];
time2 = AbsoluteTime[] - t
Result: time2=45.81 and CPU usage=100%
I also checked NMinimize for my big function with 75 (or more variables) using all methods available in Mathematica including Automatic, DifferentialEvolution, NelderMead, RandomSearch, and SimulatedAnnealing. The computation time for all of them is quite the same and CPU usage for all methods is only 12%. So it looks that minimization method I use in NMinimize cannot change parallelization conditions.
Now I think my conditions and problems are completely clear, so I would be very appreciative if someone can help me to use all capacity of my CPU in LinearSolve and NMinimize (or Minimize). I still wonder how I can make CPU usage in these cases 100%. In this way we can check whether CPU usage corresponds to actual perfomanse (like what we could see in the examples mentioned above) for LinearSolve and NMinimize or not??
Thank you very much.
**
Edit 3
**
The function I am trying to minimize is a large function including many variables. The general format of the function is sth like this:
(15 (-2.14286*10^-8 Log[E^(-1.*10^6 phi2[2]) + E^(1.*10^6 phi2[2])] + uu1[2]))^2 + 225 (2.14286*10^-8 Log[E^(-1.*10^6 phi2[2]) + E^(1.*10^6 phi2[2])] -2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[2] + phi2[3])) + E^(1.*10^6 (-phi2[2] + phi2[3]))] -2 uu1[2] + uu1[3])^2 + 225 (2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[2] + phi2[3])) + E^(1.*10^6 (-phi2[2] + phi2[3]))] -2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[3] + phi2[4])) + E^(1.*10^6 (-phi2[3] + phi2[4]))] + uu1[2] - 2 uu1[3] + uu1[4])^2 + 225 (2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[3] + phi2[4])) + E^(1.*10^6 (-phi2[3] + phi2[4]))] - 2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[4] + phi2[5])) + E^(1.*10^6 (-phi2[4] + phi2[5]))] + uu1[3] - 2 uu1[4] + uu1[5])^2 + 225 (-2.14286*10^-8 Log[E^(-1.*10^6 phi2[6]) + E^(1.*10^6 phi2[6])]+ 2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[5] + phi2[6])) + E^(1.*10^6 (-phi2[5] + phi2[6]))] + uu1[5] - 2 uu1[6])^2 + 225 (2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[4] + phi2[5])) + E^(1.*10^6 (-phi2[4] + phi2[5]))] - 2.14286*10^-8 Log[E^(-1.*10^6 (-phi2[5] + phi2[6])) + E^(1.*10^6 (-phi2[5] + phi2[6]))] + uu1[4] - 2 uu1[5] + uu1[6])^2 + ((15 (2.14286*10^-8 Log[E^(-1.*10^6 phi2[6]) + E^(1.*10^6 phi2[6])] + uu1[6]))^2) + (0.00918367 phi2[2] + (-0.00175179 - 11/112 (0.007848)) (1 - (1.*10^12 (phi2[2])^2)/(Log[E^(-1.*10^6 phi2[2]) + E^(1.*10^6 phi2[2])])^2) + ...
Where uu1[i], uu3[i] and phi2[i] are variables. The issue is that the number of variables can increase to a large number (for example 5000 or even more) which make the function hugely big!! So if I cannot use all capacity of the CPU it takes maybe days to minimize such a function, even though one computer with full capacity of the CPU is not enough to solve such a problem too, but the first step is to learn how to configure parallelization for NMinimize (or FindRoot) on a single machine to be able to extend it to parallelization on several remote machines.
Edit 4:
An example of the complete form of the function with 75 variables is:
Where variables (unknown parameters) are:
{uu1[2], uu3[2], phi2[2], uu1[3], uu3[3], phi2[3], uu1[4], uu3[4], phi2[4], uu1[5], uu3[5], phi2[5], uu1[6], uu3[6], phi2[6], uu1[7], uu3[7], phi2[7], uu1[8], uu3[8], phi2[8], uu1[9], uu3[9], phi2[9], uu1[10], uu3[10], phi2[10], uu1[11], uu3[11], phi2[11], uu1[12], uu3[12], phi2[12], uu1[13], uu3[13], phi2[13], uu1[14], uu3[14], phi2[14], uu1[15], uu3[15], phi2[15], uu1[16], uu3[16], phi2[16], uu1[17], uu3[17], phi2[17], uu1[18], uu3[18], phi2[18], uu1[19], uu3[19], phi2[19], uu1[20], uu3[20], phi2[20], uu1[21], uu3[21], phi2[21], uu1[22], uu3[22], phi2[22], uu1[23], uu3[23], phi2[23], uu1[24], uu3[24], phi2[24], P1, F1, M1, PN, FN, MN}
I know this function is extremely instable, but the optimum point of the function is also known which is equal to zero, so I am trying to find the values of the parameters which make the function minimum (zero). The parameters which can make the whole function as small as possible are the best answers.
**
Edit 5
**
Thank you all guys for your helpful comments. Based on what KAI and Oleksandr R. mentioned, and as far as I could understand, LinearSolve uses all capacity of the CPU cores to solve the equation involving large matrices. Consequently, it seems that if I want to solve some linear matrix equations for a few times, the best method is to solve each of them one by one in a LOOP to make it most efficient. In this way Mathemathica is able to use all capacity of CPU cores in each step and solve the equation in the most efficient way (in each step) and goes to the next step. But if you have a look at these 2 examples, apparently it is not like this and Parallelization forces Mathematica to solve the problem involving LinearSolve in a way that is likely more efficient and faster. You can check these examples on your computer. Based on the comments we had here, I am wondering how we can explain these examples.
Example 1:
Clear["Global`*"];
t = AbsoluteTime[];
NN = 8;
CC = Array[cc, NN];
For[i = 1, i < (NN + 1), i++,
Clear[a, b];
a = RandomReal[{i, i + 1}, {6000, 6000}];
b = RandomReal[{i}, {6000}];
CC[[i]] = LinearSolve[a, b];
];
time2 = AbsoluteTime[] - t
For example 1 CPU usage is 50% and time2=23.4
Example 2:
Clear["Global`*"];
t = AbsoluteTime[];
job1 = ParallelSubmit[a1 = RandomReal[{1, 2}, {6000, 6000}]; b1 = RandomReal[{1}, {6000}]; c1 = LinearSolve[a1, b1]];
job2 = ParallelSubmit[a2 = RandomReal[{2, 3}, {6000, 6000}]; b2 = RandomReal[{2}, {6000}]; c2 = LinearSolve[a2, b2]];
job3 = ParallelSubmit[a3 = RandomReal[{3, 4}, {6000, 6000}]; b3 = RandomReal[{3}, {6000}]; c3 = LinearSolve[a3, b3]];
job4 = ParallelSubmit[a4 = RandomReal[{4, 5}, {6000, 6000}]; b4 = RandomReal[{4}, {6000}]; c4 = LinearSolve[a4, b4]];
job5 = ParallelSubmit[a5 = RandomReal[{5, 6}, {6000, 6000}]; b5 = RandomReal[{5}, {6000}]; c5 = LinearSolve[a5, b5]];
job6 = ParallelSubmit[a6 = RandomReal[{6, 7}, {6000, 6000}]; b6 = RandomReal[{6}, {6000}]; c6 = LinearSolve[a6, b6]];
job7 = ParallelSubmit[a7 = RandomReal[{7, 8}, {6000, 6000}]; b7 = RandomReal[{7}, {6000}]; c7 = LinearSolve[a7, b7]];
job8 = ParallelSubmit[a8 = RandomReal[{8, 9}, {6000, 6000}]; b8 = RandomReal[{8}, {6000}]; c8 = LinearSolve[a8, b8]];
{R1, R2, R3, R4, R5, R6, R7, R8} = WaitAll[{job1, job2, job3, job4, job5, job6, job7, job8}];
time2 = AbsoluteTime[] - t
For example 2 CPU usage=100% and time2=19.8
ParallelTable[Pause[20], {i, 8}]yields a load of 87 % on my 2.5 Ghz i7 – chris Sep 23 '12 at 13:11ParallelTableto start parallel evaluations in all those areas. Pick the lowest result from the table returned. – Sjoerd C. de Vries Sep 23 '12 at 15:05LinearSolveon a 4-core CPU. For other types of workload that aren't as well optimized it isn't always so clear, but you can trySetSystemOptions["ParallelOptions" -> "ParallelThreadNumber" -> 8], which I found gave a boost of about 20% in this question, which also relates toNMinimize. Don't expect miracles from SMT; "CPU usage" as reported by the system monitor usually does not correspond in any direct way to actual performance. – Oleksandr R. Sep 23 '12 at 17:51t = AbsoluteTime[]; primelist = ParallelTable[Prime[k], {k, 1, 20000000}]; time2 = AbsoluteTime[] - tin one command so to speak. It does it in 60 sec on my laptop. – chris Sep 23 '12 at 19:45LinearSolvemay be because the MKL detects that HT is on and does not use it. In my experiance HT may be beneficial if the tasks done use different part of the CPU, which is not the case for numerics. Since I do a lot of numerics, I usually switch HT off altogether. Concerning NMinimize it would be good to see the problem at hand, perhaps something can be done. – Sep 24 '12 at 02:43LinearSolve,NMinimize, and tabulating primes) have very different performance characteristics, so your conclusions in each case are not transferable to the others. ForLinearSolve, you actually are using all of your CPU's resources, even though the CPU monitor claims 50% usage. Tabulating primes is "embarrassingly parallel" and can benefit from SMT (Hyper-threading), so ... – Oleksandr R. Sep 24 '12 at 07:13NMinimize, basically the current implementation is serial only, but the algorithms it uses are parallelizable in principle to some extent, which is why I asked which of the algorithms ... – Oleksandr R. Sep 24 '12 at 07:17NMinimizeitself is parallelizable) but may or may not be appropriate for any given problem. Essentially the one thing that will really help here is more detail about the function you are actually minimizing. – Oleksandr R. Sep 24 '12 at 07:21Primeis stateful; it matters which arguments have been passed already. You can get improved performance by blocking, either usingParallelSubmitor theMethod -> "CoarsestGrained"option ofParallelTable. More onPrimehere. – Oleksandr R. Sep 24 '12 at 08:16LinearSolveand ... – Oleksandr R. Sep 24 '12 at 12:15LinearSolveoptimal performance is by not using HT and (maybe) setting thread affinity. – Sep 24 '12 at 13:05FindMinimumwith the"ConjugateGradient"or"QuasiNewton"(L-BFGS) methods. If it's only slightly non-convex, try my Nelder-Mead code (not the implementation offered byNMinimize, which is much slower). If it's severely non-convex and/or possesses many local minima, you might have serious problems; differential evolution is one possible (parallelizable) option, but the algorithm is not very efficient. – Oleksandr R. Sep 24 '12 at 14:43NMinimizeoffers a parallel implementation of differential evolution--it doesn't, but it can be done. Could you please edit a full 75- (or more) dimensional example into your question? The 12-dimensional excerpt you supplied so far does not give any difficulties toFindMinimum, so it is difficult to determine where exactly your problems arise. – Oleksandr R. Sep 24 '12 at 15:09CopyToClipboard@InputForm[expr]to copy it. – Oleksandr R. Sep 24 '12 at 16:16FindMinimumandNMinimizestart with an initial guess of 1 for the value of each parameter: you are working with huge numbers whose logs are about 1 million and looking for small differences between these, which is unlikely to work out for obvious reasons. Perhaps you will have better luck if you first improve the stability characteristics of your function. – Oleksandr R. Sep 24 '12 at 18:55LinearSolve--honestly, you should split this question into several on the different (unrelated) topics you bring up--the timing difference is due to parallelization of theRandomRealcalls which otherwise are single-threaded. – Oleksandr R. Sep 27 '12 at 13:38Log[Exp[-x] + Exp[+x]]with a function that returns the same value without generating intermediate values that vary by hundreds of thousands of orders of magnitude. It will be highly preferable if you can keep all intermediates within the range limitations of machine precision numbers. – Oleksandr R. Sep 27 '12 at 13:43(-(Abs[x]*(-1 + Tanh[8 - 4*Abs[x]])) + ((3*(25 + 4*Abs[x]^2 - 375/(15 + 4*Abs[x]^2)))/64 + Log[2])*(1 + Tanh[8 - 4*Abs[x]]))/2. This deviates fromLog[Exp[-x] + Exp[+x]]by less than 1% and should suffice as a replacement. If you also scale your variables then we may be in business. By the way, do you need to find the minimum or will a value close to the minimum suffice? The problem is, I'm not sure if it's possible to do the former (or, if you do, to prove that you succeeded). The latter is difficult in a 5000-dimensional problem but not impossible. – Oleksandr R. Sep 27 '12 at 18:22Log[Exp[-x] + Exp[+x]]with a function including absolute value, like what you suggested, our target function becomes non-smooth, because of the absolute value, and, as far as I know, minimization of a non-smooth function is much more difficult. I need to find values which make the function zero, and this function absolutely has got such an answer. – mak maak Sep 28 '12 at 13:29