Get the list of URLs
There is an easy way to do this. Simply go to the
sitemap.xml to collect a list of the URLs. Here’s a script that will download the sitemap and then parse them and output the list of URLs to
Download the URLs
The best way to do this is
wget. This nice thing about
wget is that it will keep the connection open, so you probably won’t get 403ed into oblivion. Here’s what I use to download the list:
wget --no-clobber --content-disposition --trust-server-names -i URLS.txt
You can run this command multiple times and the
--no-clobber will avoid overwriting things. The
--content-disposition --trust-server-names will generate names based on the address.
Compress the files
Since I don’t like wasting space on my drive, I like to compress the downloaded files. I found that LZMA compression is really fast and gives high ratios, and its builtin to Python3 which makes later use a snap. You can do a couple of things to compress them:
The second one is nice, as it will use 8 processor cores. These are also useful commands for moving lists of files.
Also you can easily find leftovers if you get stopped in the middle using