0

I have an ecommerce site which I'd like to check the links to. There's a few factors that make the check complex:

  • filter navigation
  • a large main menu included in every page (>500 links)

So far, every checker I used couldn't cope with the scope of that check (out of memory, insanely slow, you name it). I've tried linkchecker, and so far, it works. However, the limiting factor seems to be the CPU.

The machine I'm currently using is a vServer, 8 cores, 12GB RAM, 64bit Ubuntu 14.04:

  • Python 2.7.6 (default, Jun 22 2015, 17:58:13)
  • [GCC 4.8.2] on linux2

I'm renting the machine just so I can linkcheck our site.

However, it seems to me that linkchecker doesn't utilize the other cores. The machine is permanently towing at between 101 to 104% CPU% (in top). I understand that the extraction of the links is CPU intensive, and with the default value of 100 threads/page checks in parallel, it seems to me that this would be a very good thing to spread to multiple cores.

I'm currently at about 50k links to check, with only 800 done. I'd think that the whole process could benefit alot from using multiple cores. So my question is: Why is it not using all CPU cores?

Edit: Added program and OS versions

Dabu
  • 359
  • Consider using a managed service like import.io for crawling and link checking. You can extract all sorts of info form it like http status code etc. – JayMcTee Apr 19 '16 at 13:14

1 Answers1

0

Currently, CPython is severely limited in its multithreading capability due to the famous Global Interpreter Lock

In short, due to it, most of the time CPython will run a single thread from the available (runnable) ones. This, in turn, explain your observed behavior.

No simple solutions exists. Anyway, you can try experimenting with other Python implementation (eg: Jython).

shodanshok
  • 50,565