1

I'm have a simple piece of code:

results={};
Do[If[!FailedQ[g = grab[i]], AppendTo[results, g]], {i, 2000000, 3000000}]

I had thought that the simplest way to parallelize this would be

LaunchKernels[]
SetSharedVariable[results]

And then to rerun the former with ParallelDo[] instead of Do[].

But this doesn't work. What is the correct and simplest way to do parallelize a trivial accumulation loop like this?

Here is the function for testing:

FailedQ[expr_] := FailedQ[expr, 0]
FailedQ[expr_, d : _Integer | \[Infinity]] := !FreeQ[expr, $FailedSymbols, {0, d}]
grab[n_] := Quiet @ Module[{u,r,i},
     u = "http://photo.net/photodb/photo?photo_id="<>ToString[n];
     Check[
       r = First@StringCases[Import[u,"HTML"],"ratings, "~~Shortest[s__]~~" average":>ToExpression[StringDrop[s,-2]]];
       i = Import["http://gallery.photo.net/photo/" <> ToString[n] <> "-lg.jpg", "Image"], Return @ $Failed];
     Return[{n,i,r}]]
M.R.
  • 31,425
  • 8
  • 90
  • 281
  • Note: I didn't post the code for grab because I hope that the solution will be agnostic to the internals of the custom function in the loop. – M.R. Jul 23 '15 at 22:29
  • Doesn't work why? Of course, if grab is stateful or has other side-effects, it might be far from trivial. – Oleksandr R. Jul 23 '15 at 22:38
  • What is FailedQ? Note that Reap and Sow are far more efficient at list accumulation problems than AppendTo, since AppendTo's performance degrades as the list gets larger (I think). – DumpsterDoofus Jul 24 '15 at 01:26
  • @OleksandrR. I added the function so you can try it – M.R. Jul 24 '15 at 01:53
  • @DumpsterDoofus I added the FailedQ function – M.R. Jul 24 '15 at 01:55
  • To fetch images in parallel from an online source you should use URLFetchAsynchronous and friends. Parallel* functions are not the right way to do it, because you'll spend most of your time just waiting for the server to respond/file transfer. I use URLFetchAsynchronous here. – C. E. Jul 24 '15 at 02:01
  • @Pickett But I'm confused how to make sure that function to use all kernels and cores... – M.R. Jul 24 '15 at 02:10
  • 1
    @M.R. You won't need to use several cores because that's not what's taking up time. You're not doing any heavy processing. The reason your code without any parallelizing is slow is because you fetch images synchronously; you start downloading one image, and when that's done you start on the next. All your time is spent waiting for the files to transfer, not waiting for your kernels to do computations. Now what if you could download ten images simultaneously on one kernel? That's what asynchronous fetching does. – C. E. Jul 24 '15 at 02:17
  • @Pickett I see, but how can you tell URLFetchAsynchronous how many threads to use? And don't you still have to use Import to get the actual image? I don't see any examples of getting images conditionally with URLFetchAsynchronous calls in the documentation. – M.R. Jul 24 '15 at 02:20
  • @M.R. You don't tell URLFetchAsynchronous to use a specific number of threads. Each time you call URLFetchAsynchronous you start a new job in the background. (There may be limits on how many background jobs that can run at once, I don't know what it is.) These functions are rather low level, I'm not sure what a function that loads images conditionally would look like?! Anyway, I would propose to first use URLFetchAsynchronous to get all the HTML that you require for your tests, then based on that create a list of images you want to download. Then use URLSaveAs.. or URLFetchAs.. – C. E. Jul 24 '15 at 02:56
  • Collect within subkernels, combine results at the end. – Szabolcs Jul 24 '15 at 07:10
  • @Szabolcs If I need to start downloading images conditionally after downloading the raw html, how can I start one Map before the other is finished? – M.R. Jul 27 '15 at 20:06
  • I see that you put a bounty on this question, but I feel that you would get better answers with a new question that clearly describes your situation and asks how you can download all the images, based on your conditions, as quickly as possible. – C. E. Jul 28 '15 at 01:20

3 Answers3

2

This may not be the fastest way, but it's a way:

SetSharedFunction[ParallelSow]
ParallelSow[expr_] := Sow[expr]
Reap[ParallelDo[If[countedQ[i], ParallelSow[f[i]]], {i, 1, 10^7}]]

where countedQ[i] is some Boolean function that determines whether f[i] gets added to the accumulated list. Feel free to make improvements.

Note that AppendTo scales as $O(n^2)$ where $n$ is accumulated list size, as is documented in the "Possible Issues" section of the documentation page. For small lists, AppendTo is more convenient, but for larger lists Reap and Sow are asymptotically better (I think they're $O(n)$).

DumpsterDoofus
  • 11,857
  • 1
  • 29
  • 49
0

How about

 results=Cases[
     ParallelTable[If[!FailedQ[g = grab[i]], g],
          {i, 2000000, 3000000}],Except[Null]]

At some point you'll run into a memory issue with all the Nulls, but I think with only 10^6 you are ok.

george2079
  • 38,913
  • 1
  • 43
  • 110
0

Here's a simple approach:

ParallelTable[With[{g = grab[i]}, If[FreeQ[g, $Failed], g, Nothing]], {i, 2000000, 3000000}]

Note: Nothing requires version 10.2

rhennigan
  • 1,783
  • 10
  • 19