Archiving Sites

One job that I have been taking on over the last few weeks is archiving some old project sites using a CMS. We tend to render a copy into static HTML to keep them alive, but no longer updateable. Normally, I would use wget -r <site here> to flatten the site. Occasionally the --no-check-certificate option is used if the certificate causes issues. Usually this does not cause any issues and the site is flattened, ready for code and database to be archived as well.

Recently, a site had a blog that used URLs without .html. Wget would not render it correctly so a simple Python scraper was written to get the particular URLs and recreate them as directories, each with an index.html file to maintain the structure.

No Comments

Leave a Reply

Your email is never shared.Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.