How to download a website from the archive.org Wayback Machine?

Question

I want to get all the files for a given website at archive.org. Reasons might include:

the original author did not archived his own website and it is now offline, I want to make a public cache from it
I am the original author of some website and lost some content. I want to recover it
...

How do I do that ?

Taking into consideration that the archive.org wayback machine is very special: webpage links are not pointing to the archive itself, but to a web page that might no longer be there. JavaScript is used client-side to update the links, but a trick like a recursive wget won't work.

I've came accross the same issue and I've coded a gem. To install: gem install wayback_machine_downloader. Run wayback_machine_downloader with the base url of the website you want to retrieve as a parameter: wayback_machine_downloader http://example.comMore information: https://github.com/hartator/wayback_machine_downloader — Hartator, Aug 10 '15 at 06:32
A step by step help for windows users (win8.1 64bit for me) new to Ruby, here is what I did to make it works : 1) I installed http://rubyinstaller.org/downloads/ then run the "rubyinstaller-2.2.3-x64.exe" 2) downloaded the zip file https://github.com/hartator/wayback-machine-downloader/archive/master.zip 3) unzip the zip in my computer 4) search in windows start menu for "Start command prompt with Ruby" (to be continued) — Erb, Oct 02 '15 at 07:40
Answers to https://superuser.com/questions/14403/how-can-i-download-an-entire-website also apply to this question — David d C e Freitas, Feb 05 '22 at 16:15
a few more options are described here - https://superuser.com/questions/772347/download-website-from-the-wayback-machine — xypha, Feb 09 '23 at 02:45

score 126 · Answer 1 · edited Apr 05 '22 at 06:04

126

I tried different ways to download a site and finally I found the wayback machine downloader - which was built by Hartator (so all credits go to him, please), but I simply did not notice his comment to the question. To save you time, I decided to add the wayback_machine_downloader gem as a separate answer here.

The site at http://www.archiveteam.org/index.php?title=Restoring lists these ways to download from archive.org:

Wayback Machine Downloader, small tool in Ruby to download any website from the Wayback Machine. Free and open-source. My choice!
Warrick - Main site seems down.
Wayback downloaders - a service that will download your site from the Wayback Machine and even add a plugin for WordPress. Not free.

edited Apr 05 '22 at 06:04

Muhammad Usama

3
3

answered Aug 14 '15 at 18:19

Tobias Reithmeier

1,398

2

i also wrote a "wayback downloader", in php, downloading the resources, adjusting links, etc: https://gist.github.com/divinity76/85c01de416c541578342580997fa6acf – hanshenrik Oct 18 '17 at 18:08
@ComicSans, On the page you've linked, what is an Archive Team grab?? – Pacerier Mar 15 '18 at 14:17
1

@Pacerier it means (sets of) WARC files produced by Archive Team (and usually fed into Internet Archive's wayback machine), see http://archive.org/details/archiveteam – Nemo Jan 20 '19 at 14:47
Posting for recent users: Wayback Machine Downloader only downloads .HTML files and not anything else. – non_linear Oct 09 '23 at 01:48
1

As a note: archive.org has added a rate limit which breaks most, if not all solutions to this post. There are 2 pull requests to fix wayback_machine_downloader but there has been no work on that repo from the maintainer in around a year or so. For the shell script solutions, you must add at least a 4 second delay between consecutive requests to avoid getting rate limited and your connections refused. – sww1235 Nov 17 '23 at 02:49

score 32 · Answer 2 · edited Jul 10 '16 at 04:39

32

This can be done using a bash shell script combined with wget.

The idea is to use some of the URL features of the wayback machine:

http://web.archive.org/web/*/http://domain/* will list all saved pages from http://domain/ recursively. It can be used to construct an index of pages to download and avoid heuristics to detect links in webpages. For each link, there is also the date of the first version and the last version.
http://web.archive.org/web/YYYYMMDDhhmmss*/http://domain/page will list all version of http://domain/page for year YYYY. Within that page, specific links to versions can be found (with exact timestamp)
http://web.archive.org/web/YYYYMMDDhhmmssid_/http://domain/page will return the unmodified page http://domain/page at the given timestamp. Notice the id_ token.

These are the basics to build a script to download everything from a given domain.

edited Jul 10 '16 at 04:39

bb216b3acfd8f72cbc8f899d4d6963

127

answered Oct 20 '14 at 10:16

user36520

2,975
3
22
19

9

You should really use the API instead https://archive.org/help/wayback_api.php Wikipedia help pages are for editors, not for the general public. So that page is focused on the graphical interface, which is both superseded and inadequate for this task. – Nemo Jan 21 '15 at 22:41
2

It'd probably be easier to just say take the URL (like http://web.archive.org/web/19981202230410/http://www.google.com/) and add id_ to the end of the "date numbers". Then, you would get something like http://web.archive.org/web/19981202230410id_/http://www.google.com/. – bb216b3acfd8f72cbc8f899d4d6963 Jul 09 '16 at 21:57
1

A python script can also be found here: https://gist.github.com/ingamedeo/50df5def5bce7edfbd4c6b71aa385328#file-webarchive-py – Amedeo Jun 22 '18 at 20:24
@haykam images on page seem to be broken – Nakilon Aug 22 '20 at 03:58
@haykam, it works on the google.com example but does not work for the specific page I'm trying to save. Either because the page had absolute links in it or because webarchive changed the way it works. All the css/js/png/etc. links point to the dead website instead of archived copies. – Nakilon Aug 22 '20 at 04:27

score 16 · Answer 3 · answered Jul 21 '19 at 18:56

16

You can do this easily with wget.

wget -rc --accept-regex '.*ROOT.*' START

Where ROOT is the root URL of the website and START is the starting URL. For example:

wget -rc --accept-regex '.*http://www.math.niu.edu/~rusin/known-math/.*' http://web.archive.org/web/20150415082949fw_/http://www.math.niu.edu/~rusin/known-math/

Note that you should bypass the Web archive's wrapping frame for START URL. In most browsers, you can right-click on the page and select "Show Only This Frame".

answered Jul 21 '19 at 18:56

jcoffland

379
3
8

1

This was greatly useful and super simple! Thanks! I noticed that even though the START URL was a specific Wayback version, it pulled every date of the archive. This may be circumvented by adjusting the ROOT URL, however. – Neil Monroe Mar 31 '20 at 15:32
Update to my previous comment: The resources in the site may be spread across various archive dates, so the command did not pull all the versions of the archive. You will need to then merge these back into a single folder and clean up the HTML. – Neil Monroe Mar 31 '20 at 16:43
3

this actually worked for me, although I removed the --accept-regex part, otherwise not the whole page was downloaded – FurloSK Apr 09 '21 at 07:58

score 5 · Answer 4 · answered Jan 21 '15 at 22:38

5

There is a tool specifically designed for this purpose, Warrick: https://code.google.com/p/warrick/

It's based on the Memento protocol.

answered Jan 21 '15 at 22:38

Nemo

1,134

5

As far as I managed to use this (in May 2017), it just recovers what archive.is holds, and pretty much ignores what is at archive.org; it also tries to get documents and images from the Google/Yahoo caches but utterly fails. Warrick has been cloned several times on GitHub since Google Code shut down, maybe there are some better versions there. – Gwyneth Llewelyn May 31 '17 at 16:41

score 3 · Answer 5 · answered Mar 31 '23 at 08:30

3

Wayback Machine Downloader works great. Grabbed the pages - .html, .js, .css and all the assets. Simply ran the index.html locally.

With Ruby installed, simply do:

gem install wayback_machine_downloader
wayback_machine_downloader http://example.com -c 5 # -c 5 adds concurrency for much faster downloads

If your connection fails half-way through a large download, you can even run it again and it will regrab any missing pages

answered Mar 31 '23 at 08:30

Dr-Bracket

144
1
1
9

This has been said before. – Toto Mar 31 '23 at 08:39
2

@Toto Sure, but this is a recent answer that has some explanation on how to use the command. – Geremia Jun 08 '23 at 22:53

score 1 · Answer 6 · answered Jan 15 '22 at 17:59

I was able to do this using Windows Powershell.

go to wayback machine and type your domain
click URLS
copy/paste all the urls into a text file (like VS CODE). you might repeat this because wayback only shows 50 at a time
using search and replace in VS CODE change all the lines to look like this

Invoke-RestMethod -uri "https://web.archive.org/web/20200918112956id_/http://example.com/images/foobar.jpg" -outfile "images/foobar.jpg"

using REGEX search/repl is helpful, for instance change pattern example.com/(.*) to example.com/$1" -outfile "$1"

The number 20200918112956 is DateTime. It doesn't matter very much what you put here, because WayBack will automatically redirect to a valid entry.

Save the text file as GETIT.ps1 in a directory like c:\stuff
create all the directories you need such as c:\stuff\images
open powershell, cd c:\stuff and execute the script.
you might need to disable security, see link

How to download a website from the archive.org Wayback Machine?

6 Answers6

Linked