How to crawl using wget to download ONLY HTML files (ignore images, css, js)

Question

Essentially, I want to crawl an entire site with Wget, but I need it to NEVER download other assets (e.g. imagery, CSS, JS, etc.). I only want the HTML files.

Google searches are completely useless.

Here's a command I've tried:

wget --limit-rate=200k --no-clobber --convert-links --random-wait -r -E -e robots=off -U "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36" -A html --domain=www.example.com http://www.example.com

Our site is hybrid flat-PHP and CMS. So, HTML "files" could be /path/to/page, /path/to/page/, /path/to/page.php, or /path/to/page.html.

I've even included -R js,css but it still downloads the files, THEN rejects them (pointless waste of bandwidth, CPU, and server load!).

what's the command you've tried so far? If the naming of files is consistent, you should be able to use the -R flag. Alternatively, you could use the --ignore-tags flag and ignore script and img tags. — ernie, Jan 31 '14 at 17:12
I've tried using --accept=html, but it downloads CSS files THEN deletes them. I want to prevent them from ever downloading. A headers request is fine, though -- E.g. I notice Length: 558 [text/css] on the files I don't want. If I could stop the request if the header doesn't return text/html, I'd be elated. — Nathan J.B., Jan 31 '14 at 17:36

score 24 · Accepted Answer · answered Jan 31 '14 at 18:00

@ernie's comment about --ignore-tags lead me down the right path! When I looked up --ignore-tags in man, I noticed --follow-tags.

Setting --follow-tags=a allowed me to skip img, link, script, etc.

It's probably too limited for some people looking for the same answer, but it actually works well in my case (it's okay if I miss a couple pages).

If anyone finds a way to allow for scanning ALL tags, but prevents wget from rejecting files only after they're downloaded (they should reject based on filename or header Content-type before downloading), I will very happily accept their answer!

--follow-tags=a might not be strict enough because it will still permit downloading other file types (e.g., images) if they are linked in <a> tags. — Quinn Comendant, Mar 30 '23 at 22:32

Spir · Answer 2 · 2017-04-11T16:07:41.877

13

what about adding the options:

--reject '*.js,*.css,*.ico,*.txt,*.gif,*.jpg,*.jpeg,*.png,*.mp3,*.pdf,*.tgz,*.flv,*.avi,*.mpeg,*.iso'
--ignore-tags=img,link,script 
--header="Accept: text/html"

edited Apr 11 '17 at 16:07

answered Apr 11 '17 at 15:44

Spir

231

1

I like to use both --follow-tags=a with --reject. – Quinn Comendant Mar 31 '23 at 00:52

How to crawl using wget to download ONLY HTML files (ignore images, css, js)

2 Answers2