11

I have the following problem: when I retrieve a page from Hackage, I get a large delay (about 30 seconds). Further requests are fast, but if I don't connect to it during a couple of minutes, the problem comes back.

What's interesting about this problem is:

  • it is specific to this particular site (Hackage) — I don't get a similar problem with any other site (and I visit quite a few);
  • it seems to be specific to my ISP — when I connect from other places, there's no such problem;
  • it's not related to DNS or connectivity problems — in fact, the TCP connection is established quickly; it's the HTTP response that takes too long, as can be seen from the following sample packet capture:

      1 0.000000000 192.168.1.101 -> 66.193.37.204 TCP 66 41518 > http [SYN] Seq=0 Win=13600 Len=0 MSS=1360 SACK_PERM=1 WS=16
      2 0.205708000 66.193.37.204 -> 192.168.1.101 TCP 66 http > 41518 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1440 SACK_PERM=1 WS=128
      3 0.205759000 192.168.1.101 -> 66.193.37.204 TCP 54 41518 > http [ACK] Seq=1 Ack=1 Win=13600 Len=0
      4 0.205846000 192.168.1.101 -> 66.193.37.204 HTTP 158 GET /packages/hackage.html HTTP/1.1 
      5 0.406461000 66.193.37.204 -> 192.168.1.101 TCP 54 http > 41518 [ACK] Seq=1 Ack=105 Win=5888 Len=0
      6 28.433860000 66.193.37.204 -> 192.168.1.101 TCP 1494 [TCP segment of a reassembled PDU]
      7 28.433904000 192.168.1.101 -> 66.193.37.204 TCP 54 41518 > http [ACK] Seq=105 Ack=1441 Win=16480 Len=0
      8 28.434211000 66.193.37.204 -> 192.168.1.101 HTTP 1404 HTTP/1.1 200 OK  (text/html)
      9 28.434228000 192.168.1.101 -> 66.193.37.204 TCP 54 41518 > http [ACK] Seq=105 Ack=2791 Win=19360 Len=0
     10 28.434437000 192.168.1.101 -> 66.193.37.204 TCP 54 41518 > http [FIN, ACK] Seq=105 Ack=2791 Win=19360 Len=0
     11 28.635146000 66.193.37.204 -> 192.168.1.101 TCP 54 http > 41518 [FIN, ACK] Seq=2791 Ack=106 Win=5888 Len=0
     12 28.635191000 192.168.1.101 -> 66.193.37.204 TCP 54 41518 > http [ACK] Seq=106 Ack=2792 Win=19360 Len=0
    

    (packet capture in pcap-ng format). This capture shows what happens during a simple curl http://hackage.haskell.org/packages/hackage.html.

It also doesn't matter that I'm behind a router — it's the same when I connect directly. The connection type is PPPoE.

I reproduced the problem on 3 computers that run Linux and Windows.

How to diagnose such a problem?

  • Hi, I think that you need to use a browser with developer tools enabled to see the HTTP level dialog rather than the IP level dialog. We need to see what is causing the delay and you can only do this by looking at the total set of HTTP interactions for the page. Instead, you could use GMetrix. – Julian Knight Feb 25 '13 at 21:32
  • Running GMetrix on the site gave me pretty good results with a couple of significant expections that might point you in the right direction. – Julian Knight Feb 25 '13 at 21:34
  • @JulianKnight: there's a link to the full capture file in the question — it has all the information – Roman Cheplyaka Feb 25 '13 at 21:55
  • Your link is a PCAP, I'm referring to something at a much higher level. Please report back using either a browser-based developer analysis or GMetrix or both. – Julian Knight Feb 25 '13 at 22:58
  • @JulianKnight: that's right, and it contains all the information that in principle exists about that session. What kind of information do you need in addition to that? GMetrix has a lot of noise, since it is concerned with all the CSS and JS and whatnot, while I'm showing a single retrieve of a bare HTML page (using curl). Also, GMetrix seems to retrieve the page from its own server (it says "Vancouver, Canada"), so it's not surprising that it doesn't see the problem. Anyway, since you insist, here's the link I got: http://gtmetrix.com/reports/hackage.haskell.org/UWb2TnC5. – Roman Cheplyaka Feb 25 '13 at 23:45
  • Now you have a baseline. Depending on the browser you are using, it will have a similar toolset available - you are looking for HTTP level issues not TCP/IP issues. For example, Chrome has the Developer Tools, turn those on and look at the Network tab and reload the page. Grab the output and post to an image site so we can see what is going on. – Julian Knight Feb 26 '13 at 07:32
  • I'm seeing a whole 3 second delay while the page loads and processes the CSS file for example. This persists even on reload so the CSS seems to be very inefficient. – Julian Knight Feb 26 '13 at 07:34
  • 1
    @JulianKnight: let me repeat — CSS is irrelevant here, and we're talking about a 30 seconds delay for a single HTTP request. – Roman Cheplyaka Feb 26 '13 at 07:38
  • Sorry, without the higher level information, I can't help further. – Julian Knight Feb 26 '13 at 07:41
  • This dump shows your local IP address. Could you repeat the http request while being directly connected (without a NAT box in between) and post the results? – wnrph Feb 27 '13 at 20:43
  • @artistoex: here you go. It's more or less the same. – Roman Cheplyaka Feb 27 '13 at 21:13
  • Could you also post the output of a plain tcpdump? (I'd like to see what the reassembled PDU comprises) – wnrph Feb 27 '13 at 21:47
  • @artistoex: To avoid decoding TCP packets, you can do the following: in Wireshark, Menu->Analyze->Decode As..., then choose the Transport tab and click on "Do not decode" on the left side. Or I can give you tcpdump output, but you'll have to give me the exact command line (and that command line should do filtering based on the ip address — otherwise there will be a lot of noise). – Roman Cheplyaka Feb 27 '13 at 22:28
  • @artistoex: actually, it's possible to use tcpdump to read the pcap file: http://serverfault.com/questions/38626/how-can-i-read-pcap-files-in-a-friendly-format – Roman Cheplyaka Feb 27 '13 at 22:33

4 Answers4

5

"30 seconds" and "after two minutes" are a dead ringer for a DNS issue to me.

If we suppose that the page you are connecting to does something like a DNS query on the connecting IP, and that query fails for some reason, you would see:

  • TCP connection almost instantaneous since the server is not doing DNS checks
  • the script runs a DNS query and gets stuck.
  • after 30 seconds the default timeout expires and the script goes on (you are now "Unknown")
  • on subsequent queries, the negative DNS hit is still cached and stage 1 is passed in next to no time
  • after negative timeout expires (RFC 2308), and that is anything between 2 and 5 minutes, a new query is issued on the next connect, and the story repeats.

...and these are exactly the symptoms you are describing.

You could try running a DNS query from another ISP (say, ISP2) on the IP you get from ISP1. It is not 100% proof, but I expect a high likelihood that the query will take 30 seconds to complete. That would mean that ISP1 DNS server is having problems answering to queries from the outside.

Another possible cause could be ISP1's DNS being firewalled out by Hackage for some (likely mistaken) reason (in my outfit, the reason would be "a trigger-happy netadmin", and I could name names). In that case you would have a much harder time diagnosing, for any tests through ISP2 would return nothing unusual; you'd have to escalate this to Hackage.

LSerni
  • 8,455
  • This looks very plausible! Let me verify it. – Roman Cheplyaka Feb 28 '13 at 10:58
  • For the first cause, I tried going to haskell using an anonymous proxy and it was fast, which might possibly indicate that this cause is unlikely. For the second one, the same pause is then to be expected when accessing haskell from any ISP, so it is also unlikely. DNS might still be the cause, but it might be more complicated to explain. – harrymc Feb 28 '13 at 12:08
  • @harrymc: it's very simple, actually. My ISP's DNS servers that are responsible for reverse DNS are down. So, attempts to do reverse resolving time out. Try this: dig +trace -x 80.90.233.38. I'm 95% sure that this is the cause, just waiting for confirmation that hackage indeed performs reverse DNS lookups. – Roman Cheplyaka Feb 28 '13 at 16:35
0

Problem sounds like an issue with "MTU". If you google "windows setting mtu" you should come up with a number of responses which will show you how to test this theory, and lower your MTU as appropriate. (If you were using a Linux router I could produce an IPTables command to do this dynamically for you, but I don't "do" Windows.)

davidgo
  • 70,654
  • According to the Wireshark guide, the "TCP segment of a reassembled PDU" doesn't in fact correspond to IP fragmentation but rather just indicates that the response validly contains multiple packets as you would expect from a web page. – Julian Knight Feb 25 '13 at 21:28
  • It doesn't seem to be MTU. I tested this by connecting directly via ethernet and setting mtu to 1000. The problem persisted. – Roman Cheplyaka Feb 25 '13 at 22:05
0

I have repeated your packets capture, which look this way on my end :

capture image

Effectively there is a minor undetectable pause while the packet is reassembled, but nowhere as long as yours. I have also verified all the IP-addresses and the HTML, and everything is correct and looks extremely simple and harmless.

In short, there is no reason for this delay, as far as the Internet is concerned. The conclusion is that there is a problem with your ISP.

What you can do to narrow-down the possibilities is :

  1. Try connecting to another haskell.org package and see if there is a similar delay
  2. Try using another router from your place with several computers using different network adapters
  3. Try to have somebody in your area that uses the same ISP repeat the connection
  4. Try to have somebody in your area that uses another ISP repeat the connection
  5. With this information, if you still have no explanation for this delay, contact the Support of your ISP to ask what's going on.

[EDIT]

I noticed that haskell.org sends an ETag, so that explains why the first access is slow but the next ones are fast: Because for as long as the ETag is valid, the page actually comes from your browser's cache.

The weird part here is why the ISP is not slow when transmitting an ETag request. An explanation might be that for a limited time they satisfy the request from their own cache, rather than going to haskell.org.

harrymc
  • 480,290
  • This is the same for all hackage pages. 2. As I said, I tried this on several computers and with several routers (and without one). 4. The problem doesn't exist if I use another ISP in my area.
  • – Roman Cheplyaka Feb 28 '13 at 09:49
  • Now, the ISP problem indeed looks like the only plausible solution, but what kind of problem can it be? They probably do not even suspect about the existence of hackage, so it can't be intentional. If I tell them, "hey, this one site doesn't work for me (but all the others do)", they won't listen. – Roman Cheplyaka Feb 28 '13 at 09:52
  • I added above an explanation why only the first access is slow. Point 3 still needs an answer before talking to the ISP. Their problem might be related to security software that they employ, being for some reason very slow to check the validity of haskell.org. – harrymc Feb 28 '13 at 11:48
  • Etag is irrelevant, since I use curl for testing. Anyway, the answer about reverse dns is most probably the correct one. – Roman Cheplyaka Feb 28 '13 at 11:52