Archiving Sites

One job that I have been taking on over the last few weeks is archiving some old project sites using a CMS. We tend to render a copy into static HTML to keep them alive, but no longer updateable. Normally, I would use wget -r <site here> to flatten the site. Occasionally the --no-check-certificate option is used if the certificate causes issues. Usually this does not cause any issues and the site is flattened, ready for code and database to be archived as well.

Recently, a site had a blog that used URLs without .html. Wget would not render it correctly so a simple Python scraper was written to get the particular URLs and recreate them as directories, each with an index.html file to maintain the structure.

April 11, 2025 – 11:33 am | By iain_emsley | Posted in algorithms, Information Retrieval | Tagged software, web_scraping | Comments (0)

No Comments

This site uses Akismet to reduce spam. Learn how your comment data is processed.

The Aust Gate

Archiving Sites

No Comments

Leave a Reply

Elsewhere on the web

Categories

Archives

Search

Open Knowledge

RSS Feeds

Meta