2

If a large companies website's robots.txt file has no Disallow section, does this mean I am free to write code to crawl their website?

The website in question is basically a data warehouse for the type of information I need, information which is updated on a minute by minute basis (so I'm going to be polling), and their robots.txt file looks like so:

User-agent: *

They are a global company so I assume they know how a robots.txt file works, does this mean I can crawl away, or should I contact them first?

I'm not asking from a legal perspective, but more from the point of view of a developer/security expert intentionally writing the above robots.txt file, if you do this are you essentially saying crawling is ok?

JMK
  • 2,506
  • 7
  • 29
  • 40
  • 5
    This question appears to be off-topic because it is requesting legal advice, which not only may vary from jurisdiction to jurisdiction but also from case to case, and so should be obtained from a qualified legal practitioner in the appropriate jurisdiction rather than from the Internet where the well-meaning and logical opinions you receive on the matter may leave you more ill-advised than if you hadn't asked at all. – Xander Oct 17 '14 at 21:55
  • 1
    Is this a legal advice question? I was hoping to phrase it more from the point of view of a developer/security expert intentionally writing the above robots.txt file, if you do this are you essentially saying crawling is ok? – JMK Oct 17 '14 at 21:57
  • 1
    It's about interpreting legality of something. That's what lawyers do. Bless them LOL – TildalWave Oct 17 '14 at 23:25
  • 2
    @JMK. It's absolutely a legal question. The robots.text file has nothing to do with whether crawling is OK or not...That is determines by the site owner (not the developer) and the laws of the jurisdiction under which they might pursue you if they decide they're not OK with it. – Xander Oct 17 '14 at 23:57

4 Answers4

7

A robots.txt file does NOT imply any legal permission one way or another. It's only purpose is to limit the results of a crawler for crawlers who choose to respect the content of the robots.txt file.

schroeder
  • 129,372
  • 55
  • 299
  • 340
3

An empty or missing robots.txt file means that you are free to crawl their entire site - I would extend that rule to files that contain text, but no actual content. Remember that this isn't a long-term grant of permission - if the site owner puts up a valid robots.txt at some later date, your code should detect it and begin respecting it fairly quickly.

From robotstext.org:

To allow all robots complete access

User-agent: *
Disallow:

(or just create an empty "/robots.txt" file, or don't use one at all)

Brilliand
  • 297
  • 2
  • 7
  • 3
    robots.txt actually doesn't support explicit allows, merely explicit disallows so I disagree that it has any merit on legality of doing something. It can be considered as a way of disallowing specific crawlers but each site can still choose to expect of any type of visitors to respect other notices (such as e.g. TOS, copyright,...) in text or hypertext. So all you can say by not being excluded in robots.txt is that the site owners didn't explicitly disallow your bot from crawling their site, not that you're automatically granted the right to do so. Consider e.g. a crawler's name change. ;) – TildalWave Oct 17 '14 at 23:37
  • 5
    Completely wrong. If the ToS disallows crawling, the robots.txt file is irrelevant. – Xander Oct 17 '14 at 23:59
  • It's also grossly misinterpreting how robots.txt is meant to be parsed. The two lines included in the example don't mean your crawler is allowed at all. You need to parse top down and left to right all of it as an incremental set of restrictions. Meaning, if the next two lines explicitly disallow your crawler, the first two that allowed it don't matter. In fact, the example two lines are by all accounts obsolete and you'd only have them in your robots.txt file because you probably copied it from some template or it was included as such in some boilerplate. – TildalWave Oct 18 '14 at 00:14
  • Anyway, that's Wikipedia for you. Instead, rather read carefully through About /robots.txt. You'll notice it actually says exactly the opposite to what Wikipedia does. Funny enough, even Wikipedia's own robots.txt disagrees with how it is interpreted on its own Robots exclusion standard page. – TildalWave Oct 18 '14 at 00:27
  • @TildalWave The "About /robots.txt" you link to says almost exactly the same thing under "To allow all robots complete access". I don't understand why you describe that as "exactly the opposite". – Brilliand Oct 18 '14 at 08:33
  • @TildalWave I've replaced the Wikipedia quote with a robotstext.org quote - I see that removed a statement that wasn't really relevant to my answer anyway (all I really care about is the last line - that an empty or missing file is a global allow). After reading the w3 specification, though, I think Wikipedia is actually correct - that set of two lines does precisely mean "all robots may crawl all pages on this site", and if any robots are specified individually, that would mean "except you". – Brilliand Oct 18 '14 at 08:55
  • 1
    The difference is that Wikipedia talks about permissions. There's no such thing in robots.txt, that's my point and why Wikipedia is plain wrong. I guess its authors there have same problems differentiating between permissions and restrictions. Not sure why, it's actually in its title: Robots Exclusion Standard. Nothing about it gives anyone any permission. Wikipedia's interpretation is like saying it's not a rape if someone didn't explicitly mention you individually by name that you can't have sex with them. It can be. You didn't get a consent to be sure that it isn't. – TildalWave Oct 18 '14 at 11:59
  • @Xander do you have a reference for your claim that ToS supersedes robots.txt? I think you are wrong, as the crawler never agreed to the ToS but I don't have a reference for this. – paj28 Oct 18 '14 at 16:01
  • @paj28 Yes I do. However, as you and I are not lawyers, neither one of our opinions matters so I have no interest in a continued unproductive disscusion in the comments. – Xander Oct 18 '14 at 18:43
  • A crawler really cannot be expected to read a ToS, because it's written in a human language. I think that makes the ToS irrelevant to this discussion. – Brilliand Jan 20 '15 at 21:39
0

Ethically you should not crawl what they are requesting not to crawled, but some of the spiders ignore this file and crawl everything that they can.

Legally I do not know if there is any implication because it might depend in each country laws.

I would stay by the ethical part and if someone is saying me not to i will have to respect.

If they say nothing you can crawl at will because it will be the same that any search engine spider will do.

Hugo
  • 1,701
  • 11
  • 12
-1

I dont see this as a "legal" question. Its more a "moral" question, is it accetable, out of a security perspective, to "crawl" a website that adverts no disallow to bots, or would it by the commodity be considered "hacking"? (this regardless of it is allowed by law to crawl websites without permission or not)

I would say - it depends. Crawlers can come in for a "good" or "bad" purpose. "Good" crawlers can then be considered "good behaving" or "bad behaving".

This gives us 3 types of crawlers:

"Bad" crawlers, those are Always considered to be "bad behaving".

"Good" crawlers, who are "bad behaving".

"Good" crawlers, who are "good behaving".


If the crawler in general is "good" or "bad", depends of what purpose the crawler has. Eg your intent. If the crawler has a intent to "leech", "parasite" or collect data from the website for the purpose or summarying data from multiple websites onto your website, or even worser, collect email adresses or URLs from a website for other uses, I would say its a "bad" crawler. Then robots.txt wont matter. Same if you crawl for security holes (for your own pleasure) or if you crawl for the purpose of offline vieweing. Then you should Always ask for permission Before crawling.

If you instead do a good thing, normally a service to the public. Lets say you do a special search Engine for certain filetypes, a search Engine that allows a user to do a local search on only one website in real-time (similiar to site: in google) or if you do a service aiming at webmasters, then I would say its a "good" crawler. Lets say you do a online service to test the security of a website, or you do a "link checker" crawler that checks for dead links on all pages.

In the first case (special search Engine), I would say following robots.txt protocol is a good thing. In the second case, I would say, following robots.txt is a good thing with a small exception: Then you should disregard any user-agent: * lines, and explicity require the webmaster to give permission to your bot, like

user-agent: LinkChecker
disallow: 

robots.txt is a excellent way to ensure a webmaster gives your permission Before doing any crawling that should be limited to webmasters only.

sebastian nielsen
  • 8,964
  • 1
  • 20
  • 33