Tuesday, 1 April 2014

bulk download from archive.org cache - wget



--random-wait  - random wait to go slower.

-r recursive to recurse in all the subfolders/pages
-p we get all images needed to display the page
-e robots=off - ignore robots.txt rules
-E set as html files with html content
-np don't recurse up to parent folders
 -U "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14" - useragent mozilla
--no-check-certificate - do not check ssl certificates for https connections


 ----------------------archive.org.bat----------------------
#random wait, recursive, get all images needed to display the page,
#ignore robots.txt rules, set as html files with html content,
#don't recurse up to parent folders, useragent mozilla,
#do not check ssl certificates (for https).

wget --random-wait -r -p -e robots=off -E -np -U "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14"  --no-check-certificate %1

 ---------------------- ---------------------- ----------------------
 Run: archive.org.bat  https://web.archive.org/web/20140307102328/http://www.cachedwebsite.org


NOTE: wget is suggested on the archive.org's blog! There's nothing bad into using it with a bit of attention and without rushing.