2

I've been working on a task to find threats in a social network called Skyrock. For which, I've managed to retrieve a significant amount of URLs by crawling various user profiles and parsing the data for URLs. These URLs basically are a part of data, that some of the users share publicly on the network. Now, I want to check whether any of these URLs is malicious or not.

I know there are a lot of online url scanners available on the internet to scan for a malicious link, but don't want to use any of those. Instead, I want to use information which am able retrieve about a URL, using some third party API, for example: location (lat, long), url redirects, redirect count, DNS entries etc. Can any of these properties be used to check if a particular URL is malicious or not? What other information about a link do I need to check if it's malicious or not?

Note: A malicious link, in this case, could be a link to a phishing site, a spam link, or either a link that makes you download some malware. I just want to know what properties related to a link are useful in figuring out whether a link falls under these categories of malice or not.

Rahil Arora
  • 4,357
  • 2
  • 25
  • 42

1 Answers1

3

Online URL scanners have three methods to check URLs for being malicious:

  1. Blacklists of known bad URLs. Someone reported that very URL as malicious and entered it into a database. These cases are trivial to check for the scanner but require constant effort to keep the database up-to-date. Considering that many malware-serving websites only exist for hours before they get removed by the hoster or get "un-hacked" by their real admin, they outdate very quickly.
  2. Known malware samples. They download the linked website and then use a database of known web-based exploits to search for signature strings in it which hint that it is using one. The half-life time of such database entries is much longer than that of URLs, because there are some widely-used stock exploits which are deployed over and over again on thousands of URLs. This method does not help against zero-day exploits and might be fooled by self-written exploits for known vulnerabilities or running stock-exploits through an obfuscator.
  3. Heuristics. By analyzing the HTML code and executing dynamic content in a sandbox they can find suspicious behavior and report it. Just like with normal virus scanners, the potential for false-positives is very high.

Note that the methods 2 and 3 aren't foolproof. The malicious website could try to detect that it is downloaded by the URL scanner and serve different content than it would serve to a regular visitor.

Philipp
  • 49,384
  • 8
  • 129
  • 160