7

URLSaveAsynchronous can download one url page, how to download 100000 urls in an efficient way?

And better to with an whole progress bar.

I've downloaded 90000 htmls a web-dictionary in mathematica this way, I think it's interesting.

progress= 0.;
progFunction[_, "progress", {dlnow_, dltotal_, _, _}]:= Quiet[progress = dlnow/dltotal]
Dynamic[ProgressIndicator[progress]]

cells = Table[Defer[URLSaveAsynchronous][urls[[i]], names[[i]], progFunction, 
"Progress" -> True], {i, 10000}];

Here is the working Document.

CreateDocument[ExpressionCell[#, "Input"] & /@ cells]

Maybe can speed up it?

Another aspect is how to do the whole job in Mathematica more comfortably.

To build the download tasks took so much time in the working document which looks like I'm not downloading in an asynchronous way.

HyperGroups
  • 8,619
  • 1
  • 26
  • 63
  • 4
    "how to download 10000 urls in an efficient way?" — you use a different tool better suited for the job. – rm -rf May 19 '13 at 05:52
  • You should consider using python. – chyanog May 19 '13 at 06:38
  • 2
    curl or wget perhaps with GNU parallel should be perfectly sufficient. – Ajasja May 19 '13 at 06:47
  • @rm-rf en,some tools like one of my frequncy-used only accept 1000 urls-one time. – HyperGroups May 19 '13 at 07:57
  • @Ajasja @@I'm a Windowser. – HyperGroups May 19 '13 at 07:58
  • @chyanog know nothing about python. If I know python, I'll like 3d-software-Blender much more. – HyperGroups May 19 '13 at 08:01
  • No problem. I'm on windows too. There are native ports of these tools. – Ajasja May 19 '13 at 09:10
  • 1
    I don't understand this question - surely the limiting factor on speed is the network (and the nature of the URLs) rather than the code used to initiate the download? – cormullion May 19 '13 at 09:28
  • 1
    Answer: Use Python, and efficient proxy server. – Misery May 19 '13 at 11:16
  • 3
    Consider the damn-slow-network-speed in China...... – mmjang May 19 '13 at 13:28
  • @mm.Jang While I upvote you with the damn-slow-network-speed in China, I think the bottleneck for downloading hundreds of thousands of small files (I assume the pages that OP wants are mainly plain-text) is not the network bandwidth, but your local I/O efficiency. (Imaging copy 100000 1kB files from one disk to another!) – Silvia May 19 '13 at 18:56
  • Before URLSaveAsynchronous I used Mathematica to monitor the performance of a ruby download script and change the number of threads to maximize performance. Now it should be possible to do all in Mathematica and take advantage of the advanced parser (patterns for XML) not found in other languages. I wish I had the time to work on this now. – Gustavo Delfino May 19 '13 at 18:58
  • @Silvia yeah, actually Import a html file as plaintext also take the same time as URLFetch from urls. – HyperGroups May 21 '13 at 02:55
  • @GustavoDelfino wish your work on the similar topic, there are many benifits about XML and StringCases and Batch generate url links in Mathematica – HyperGroups May 21 '13 at 03:01
  • @cormullion One aspect, how to organize the whole task is one important problem. For the job maybe down in several hours or several days. In my prevous method, no whole progress bar for this task, and read the command into mathematica to build the task seems take too much time which seems I'm not doing the asychronously download. – HyperGroups May 21 '13 at 03:05
  • @mm.Jang well, The Baidu Dicts spent me 10hours and 3GB html files. Wish the Chinese Great Wall will not wall the MathematicaSE – HyperGroups May 21 '13 at 03:07
  • @HyperGroups You can't use Chinese to ask and answer on stackexchange-network till now, so i think it won't be walled. – mmjang May 21 '13 at 08:47

1 Answers1

16

This is a complicated task and I believe that Mathematica is not the best tool to do it. If you want to do it only once, just go ahead with the method described in your question. Otherwise, if it is a frequent task, try to select a better tool such as Apache Nutch, which has a built in crawler. Not so easy to get used to, but it saves a lot of time afterwards.

Anyways, if you want to choose Mathematica, here is some guidelines that increase the efficiency. Writing a high-performance crawler requires a good knowledge of the TCP/IP protocol stack. Specifically, you should be familiar with the details of TCP and HTTP protocols. I am not going to write the code since testing the code requires the database of links and takes a huge amount of time.

I assume that you want to fetch the urls using a single machine. I also assume your machine has enough power so that processing/storage is not the bottleneck. There are at least two bottlenecks involved in the crawling process then. One is your network throughput (the connection between your machine and the outside world) and the other one is the server(s).

Fetching the urls one by one is not a good idea since because of the latency, the throughput of your crawler would be very low. Fetching all of the urls in parallel at once is a bad idea too. There are limits on the number of connections that any machine can handle at the same time (because of the limited amount of memory and processing power available). Servers also disallow their clients to fetch too many pages in a short period of time to prevent denial-of-service attacks. Therefore, there must be an optimal value, say $k_t$, for the number of parallel fetches at any given time. And, this number, because of the variable condition of your network, is changing during the time.

Guideline 1: Do not open too many connections to a single server. Open at most 3-5 connections. Instead, use multiple servers to have more connections open. For example, if $k_t=40$, select 10 servers and open at most 4 connections to each of them. If you have more than 10 servers in your list, you can choose more servers, each with fewer connections.

Guideline 2: To find the optimal value of $k_t$, I propose you to use a closed loop controller. Set an initial value for $k_0$, say 10. Increase the amount of $k_t$ once every few minutes as long as increasing the amount, increases the throughput as well. Otherwise, decrease it. Knowing how TCP controls congestion helps a lot.

Guideline 3: There is another trick used by crawlers and browsers to speed up the download process of multiple files. But, I am not sure if there is a straightforward way to implement it in Mathematica (e.g., without using Java). Here is the trick: Download multiple files, back to back, with a single TCP connection to the server. I think, the URLSaveAsynchronous function by default does not do such a thing and you need a bit more effort to implement it. The reason behind this technique is that making a connection to the server takes time and has a huge overhead. Therefore, reusing the connection for several downloads, shares the overhead.

Helium
  • 4,059
  • 24
  • 41
  • nice guide, I have some deep feeling for them, one by one is absolutely not a good choice, two many connections will reach the web and I/O's bottleneck. – HyperGroups May 21 '13 at 02:58