4

I have a few thousand image urls that I want to download asynchronously, how can I do that while monitoring progress? I'm really asking for 5 things:

  • Show dynamic progress in bar
  • Downloads must be asynchronous
  • Avoid any filename collisions
  • Save with correct file extensions (even if not present in url)
  • Show list of failed download tasks and why

Here's an example to get started:

Monitor[
    URLDownload[
         WebImageSearch["dog", "ImageHyperlinks", MaxItems -> 10], 
         "~/Downloads/"
    ]
] 

Updated response to comment

I don't believe URLDownloadSubmit takes a directory, this is the behavior I see (no progress indication):

dogs = WebImageSearch["dog", "ImageHyperlinks", MaxItems -> 10]
URLDownloadSubmit[dogs, "~/Downloads", 
 HandlerFunctions -> <|"TaskProgress" -> Print, 
   "TaskComplete" -> Print|>, 
 HandlerFunctionsKeys -> {"FractionComplete", "ByteCountDownloaded"}]

enter image description here

And the filenames are wrong:

enter image description here

Related but not duplicate:

M.R.
  • 31,425
  • 8
  • 90
  • 281
  • 2
    You can use HandlerFunctions like so: URLDownloadSubmit[dogs, "~/Desktop/test", HandlerFunctions -> <|"TaskProgress" -> Print, "TaskComplete" -> Print|>, HandlerFunctionsKeys -> {"FractionComplete", "ByteCountDownloaded"}] to print the FractionComplete and ByteCountDownloaded. It's probably a documentation bug that there's no good example of how to do it, but it is mentioned in the usage of URLDownload and URLDownloadSubmit. You can use ProgressIndicator[Dynamic[frac]] and frac = #FractionDownloaded instead of Print. – Carl Lange Aug 12 '20 at 21:26
  • @CarlLange thanks, see my update – M.R. Aug 12 '20 at 21:38
  • 1
    I think the message is because "~/Desktop/test" is a file and you are trying to download multiple files to it. Try just "~/Desktop". – Carl Lange Aug 12 '20 at 22:11
  • @CarlLange Sorry for the typo, I fixed that but it doesn't look like it's doing anything still. If you could your snippet with a progress bar I'll accept! – M.R. Aug 12 '20 at 23:06

2 Answers2

4

What I would do is run URLDownloadSubmit once on each URL that you want to download. That will make all those files start to download asynchronously in the background as a result.

When downloading that many images, it's not worth it to keep track of how many bytes have been downloaded of each image. It's better to just count the number of files that have been downloaded so far. That should be granular enough.

Start by getting the URLs:

urls = WebImageSearch["dog", "ImageHyperlinks", MaxItems -> 10];

Create a progress indicator:

i = 0;
ProgressIndicator[Dynamic[i], {0, Length[urls]}]

Let outputDir be your output directory. This will start downloading the files:

submitURL[url_] := URLDownloadSubmit[
  url,
  FileNameJoin[{outputDir, Last@Information[url, "Path"]}],
  HandlerFunctions -> <|
    "TaskFinished" -> (i++ &)
    |>]

submitURL /@ urls;

I used FileNameJoin[{outputDir, Last@Information[url, "Path"]}] to keep the original names as requested. However, note that the names may collide with each other. The advantage with UUIDs, as in your screenshot, is that they are always unique.

C. E.
  • 70,533
  • 6
  • 140
  • 264
  • When there are missing urls, the progress indicator will never be 100%. How can you add an error handler to count and show the few bad ones? – M.R. Aug 24 '20 at 22:25
  • If you can add error handler I will accept – M.R. Aug 25 '20 at 01:29
  • Actually, that collision fix isn't perfect as FileExtension returns "" on some urls, is there a better way to inject the uuid into the filename? – M.R. Aug 25 '20 at 05:55
  • It’s nice that this is abortable but how to handle failures in a nice way? – M.R. Aug 25 '20 at 07:23
  • @M.R. It depends. Perhaps it would work to just make a list of all the files in that have been downloaded and compare it with the URLs that you tried to download. That way you can make a list of all the files that you failed to download. It would be nice to have some examples of filenames where FileExtension fails, perhaps these are especially tricky cases that need special treatmeant not matter how it is done. – C. E. Aug 25 '20 at 08:03
  • There are a few in that set of urls – M.R. Aug 25 '20 at 14:39
  • I've added a response to your solution based on your last comment. – M.R. Aug 25 '20 at 17:23
  • @M.R. You may consider FileFormat as an alternative to FileType. For the retries, I don't see any downside, compared with the button, to retrying a fixed number of times. – C. E. Aug 25 '20 at 19:01
  • I'm cleaning up the answer now, it was quite convenient to exchange information in this way but I think that it will be confusing for future visitors. – C. E. Aug 25 '20 at 19:06
  • Wait by cleaning up did you mean improving on the responses I added or just deleting them? – M.R. Aug 26 '20 at 05:47
  • @M.R. Removing your responses to make the answer more useful -- less confusing -- to future readers. – C. E. Aug 26 '20 at 06:57
  • But I improved your answer... – M.R. Aug 26 '20 at 06:57
  • @M.R. If you want to share what you found, the best way is to post your own answer with pieces of code for problems that you solved. I see the merit in what you wrote, but it is not my answer. I would like to keep it short and simple for future visitors. – C. E. Aug 26 '20 at 06:58
3

Does something like this work for you?

With[
    {
    links = WebImageSearch["dog", "ImageHyperlinks", MaxItems->10],
    dir = CreateDirectory[]
    },
Monitor[
    Table[URLDownload[links[[i]], dir], {i, Length[links]}],
    ProgressIndicator[i / Length[links]]
]

]

Your original question wasn't clear to me that the downloads were to be asynchronous. Perhaps you can use LocalSubmit for this purpose, e.g.:

With[
    {
    links=WebImageSearch["dog","ImageHyperlinks",MaxItems->10],
    dir=CreateDirectory[]
    },
p=0;
LocalSubmit[
    Table[
        URLDownload[links[[i]],dir];
        Print[i],
        {i, Length[links]}
    ],
    HandlerFunctions-&gt;&lt;|&quot;PrintOutputGenerated&quot; -&gt; Function[p = First @ #PrintOutput]|&gt;,
    HandlerFunctionsKeys-&gt;{&quot;PrintOutput&quot;}
];
Dynamic @ ProgressIndicator[p/Length[links]]

]

Response:

Thanks for updating, here are 100 image urls to test on:

urls = CloudGet["https://www.wolframcloud.com/obj/f691b40d-b12a-41b5-9017-9130bc797fd0"]; 

It seems like it is still downloading synchronously because it takes ~40 seconds, but @C.E.'s only takes ~10 seconds (and speed is the most important factor to me). On the plus side, it seems all 100 images download correctly and the filenames avoid collisions, but the With block won't Abort.

M.R.
  • 31,425
  • 8
  • 90
  • 281
Carl Woll
  • 130,679
  • 6
  • 243
  • 355
  • Well not really, because it's not async, and so will be waaay to slow for 10k images. – M.R. Aug 17 '20 at 22:06
  • Do you see what I mean? – M.R. Aug 19 '20 at 03:36
  • You solution is too slow because it is sychronous, so do see a way around this? – M.R. Aug 22 '20 at 02:49
  • @C.E. Your comment doesn't make sense. LocalSubmit uses one kernel for its code, not one kernel per file. – Carl Woll Aug 22 '20 at 15:43
  • @C.E. Do you agree that @CarlWoll's solution seems to be 5x slower? Why is that, I thought tasks in LocalSubmit would be asynchronous too? – M.R. Aug 25 '20 at 17:01
  • @M.R. I believe the speed difference is because LocalSubmit starts one kernel to do all of the downloads, while each URLDownloadSubmit uses a new kernel. So, the timing difference is related to the number of kernels you have available. – Carl Woll Aug 25 '20 at 17:06
  • 1
    @CarlWoll No, that was what I first thought that LocalSubmit did. That would have given terrible performance. I assure you that URLDownloadSubmit does not use multiple kernels. – C. E. Aug 25 '20 at 19:05