I dont see this as a "legal" question. Its more a "moral" question, is it accetable, out of a security perspective, to "crawl" a website that adverts no disallow to bots, or would it by the commodity be considered "hacking"? (this regardless of it is allowed by law to crawl websites without permission or not)
I would say - it depends.
Crawlers can come in for a "good" or "bad" purpose. "Good" crawlers can then be considered "good behaving" or "bad behaving".
This gives us 3 types of crawlers:
"Bad" crawlers, those are Always considered to be "bad behaving".
"Good" crawlers, who are "bad behaving".
"Good" crawlers, who are "good behaving".
If the crawler in general is "good" or "bad", depends of what purpose the crawler has.
Eg your intent. If the crawler has a intent to "leech", "parasite" or collect data from the website for the purpose or summarying data from multiple websites onto your website, or even worser, collect email adresses or URLs from a website for other uses, I would say its a "bad" crawler. Then robots.txt wont matter.
Same if you crawl for security holes (for your own pleasure) or if you crawl for the purpose of offline vieweing. Then you should Always ask for permission Before crawling.
If you instead do a good thing, normally a service to the public. Lets say you do a special search Engine for certain filetypes, a search Engine that allows a user to do a local search on only one website in real-time (similiar to site: in google) or if you do a service aiming at webmasters, then I would say its a "good" crawler.
Lets say you do a online service to test the security of a website, or you do a "link checker" crawler that checks for dead links on all pages.
In the first case (special search Engine), I would say following robots.txt protocol is a good thing.
In the second case, I would say, following robots.txt is a good thing with a small exception: Then you should disregard any user-agent: * lines, and explicity require the webmaster to give permission to your bot, like
user-agent: LinkChecker
disallow:
robots.txt is a excellent way to ensure a webmaster gives your permission Before doing any crawling that should be limited to webmasters only.