Archiving Wordpress Sites as Static HTML

timmmmyboy · August 21, 2016, 4:14pm

I love SiteSucker, though I do wish there was something that could work more programmatically on the web (it being a desktop app has its limitations there). It does handle scripts and relative URLS very very well. Some example URLS:

http://vsteconference.org/2015/
http://vsteconference.org/2014/
http://vsteconference.org/2013/
http://vsteconference.org/2012/

That used to be a multisite and I got tired of maintaining themes and plugins for really old sites that would no longer be receiving updates so this was a nice compromise. It wouldn’t work with search and contact forms but that’s to be expected I think. I do think it even crawls tag/cat pages as well. I’m a big fan.

econproph · August 21, 2016, 6:37pm

This is great! I’m beginning to think I’m psychic (or psycho). I no sooner start thinking about "oh, I’m gonna need a way to do xxxxx (like convert older WP sites into static HTML) and, lo, @cogdog @jimgroom and @timmmmyboy and the rest of the Reclaim community solve it for me. Thnks.

cogdog · August 21, 2016, 7:41pm

Tim is always ahead of the game. That Wordpress plugin choked on a bigger site, so I broke out my copy of Sitesucker. This Wordpress site, used from 2005-2011 is now all static html

I wrote up some notes- the biggest things are do do some pre-work to the site to remove forms that wont work (remove search forms, turn off all comments). I would also recommend reading more in the site sucker settings, I should have kept mine limited to a directory and it started walking my whole domain outside the WP install.

It does a fantastic job- all full URL link are made relative, so it could also be hung at a new domain.

econproph · August 21, 2016, 9:13pm

Hmm. It looks like Sitesucker is Mac/iOS only? Any Linux (or in a pinch Win) equivalents you recommend?

cogdog · August 21, 2016, 9:30pm

No direct experience. There’s probably some command line stuff. The only one for PC I’ve heard of is WebWhacker ($49) offline browser download WebWhacker 5.0 software for windows

timmmmyboy · August 21, 2016, 10:14pm

I’m pretty sure Sitesucker is just a fancy GUI wrapper for wget. On Linux if you’re comfortable in Terminal I would try wget following the guide at Make Offline Mirror of a Site using `wget` – Guy Rutenberg and see if that works. There are lots of nifty flags with that command to control how it archives a site.

timmmmyboy · August 21, 2016, 11:35pm

Which now as I type that makes me think this could actually be done server side given all of our servers have access to wget. Hmmm…

timmmmyboy · August 22, 2016, 1:00am

Ok so wget is actually really good and opens up some interesting possibilities here. I tried the command from that linked guide

wget -mkEpnp https://blog.timowens.io

when logged in via ssh to my Reclaim Hosting account and it generated a folder named blog.timowens.io with everything inside it. It took 5 minutes and 43 seconds to download 55Mb of stuff and convert the links in all of the pages to local relative ones. I uploaded it to Amazon S3 which can do static file hosting so that folks could see the result.

I could see combining wget with s3cmd and automating this whole thing which would be really interesting. If nothing else it would also just be crazy cool to be able to have a cPanel app that could take a URL as input and throw a folder in your hosting account with a static archive.

bionicteaching · August 24, 2016, 2:43am

I wish to have that.

jimgroom · August 24, 2016, 8:39am

That is so awesome! How cool, so something like this could possibly be automated on Reclaim?

bionicteaching · August 24, 2016, 12:53pm

Seems like it could be a plugin . . . for my scenario that’d be crazy slick.

It actually seems pretty straightforward (in my head anyway). Just need one user credential page for the S3 stuff.

timmmmyboy · August 25, 2016, 3:48am

Making this even more interesting, I tested the wget functionality with a local hosts file and was able to archive a site on a server for which the domain expired almost a year ago. Damn that’s cool.

I’ve started building the plugin this could become, just a dummy interface for now. I’ll probably start with a basic “Give me a URL and the folder location you want to save to” and then once that’s working we can look at fancier options like scheduled archive, S3 and other remote archives, etc.

cogdog · August 25, 2016, 5:05am

Keeps getting better and better.

If you want to make sure people are archiving only their hosted sites, I wonder if you could do something like the way Google does site verification - generate a file with some kind of hashed name/code in it that has to be loaded at the root level of the server. The script could then do some kind of verification to make sure the person is only archiving a site they manage

jimgroom · August 25, 2016, 9:58am

I was interested in Boris Mann’s idea on Twitter of this method losing metadata:

Would love if he expounded a bit on that, he’s the one who was pointing to the new hackstack before it became all the rage

bmann · August 25, 2016, 2:18pm

Hey all. If you take a database backed site like Wordpress or Drupal and archive it to “flat” HTML, you take a one way trip to losing all metadata.

What I mean by that is, information about posts and pages like date created, author, tags, categories, etc.

Especially for large archives, it means you can’t easily remix the site content again.

I’ve been using Jekyll, a static site generator, for this same purpose. Exporting to Jekyll means individual posts or pages are exported into HTML / markdown, with YAML front matter that contains this metadata.

That last bit was a bit gibberish if you haven’t played with Jekyll yet. There is a block of text at the top of each text file that has author, tags, etc.

The downside to exporting to Jekyll is that it doesn’t preserve the theme (because it’s saving the content, not the presentation layer). And, that it’s learning a little bit of Jekyll.

There is a WP plugin for Jekyll exports: Jekyll Exporter – WordPress plugin | WordPress.org English (Canada)

Haven’t tried it. Here at Reclaim, you might run a global instance of Jekyll in order to generate the flat HTML.

More complicated? Yes. I’m a big fan of GitHub Pages, where every site automatically runs Jekyll and does free hosting including domain names.

Hope that helps explain what I mean.

timmmmyboy · August 25, 2016, 2:25pm

Appreciate the clarification and that’s a fair point, switching to a different CMS like Jekyll is definitely more flexible if you want to be able to reuse the content again in another context versus simply archiving it. (I also like @cogdog’s idea of simply keeping a dormant copy of the database or SQL backup in case you want to revert). But if the goal is to actually archive I’m not sure I agree the metadata is lost. Look at Investing in Community as an example. Viewing the source and looking at the post itself all the tags and other information are completely intact. You’re right that I can’t turn this into anything else, but as an archival method it still seems to me like a really nice option. I suppose with any archiving methodology though the rule of thumb is to have a variety of formats to support longevity.

csmattia · August 25, 2016, 4:00pm

yep, wget has been my go to UNIX utility for years. You can run it on Mac, Win, Linux pretty much anything. I use it to crawl websites and create a local mirror of them getting all related files. Since web apps hide the server side code and just deliver the HTML to the browser, that’s all wget sees so you end up with a local html archive.

bmann · August 25, 2016, 4:09pm

We can argue and say “easily parseable” or “documented” metadata. The markdown files are more like a DB backup.

Those markdown files are easier to re-hydrate than HTML, which you have to write a custom parser for.

Small sites: not a big deal. Big sites: big deal.

cogdog · August 25, 2016, 4:44pm

I had not thought of the metadata issues. Like Tim suggested, the 2 small to puny sites I did are purely to preserve the sites as they were presented. The chances of them ever being needed again are on the order of me going on a trip to Mars.

But a slap on the head thing I should have done before the decommissioning Wordpress front end would have been to export the site- at least I would have the posts and meta data as XML. Someone (aka not me) could probably figure a way to do some kind of meta data export as JSON?

It’s pretty exciting to see how a thread like this rolls into a big ball of possibility.

econproph · August 25, 2016, 4:50pm

I think I’m with you @timmmmyboy on this. Use cases I’m thinking for this are: 1. A WP site is created to facilitate students writing as part of a course assignment (write-in-public type thing) for course x in semester y - prob with a sub-domain or sub-blog on a multi-site. For various reasons, we want/need to create a diff Wp site for the similar assignment for course x in semester y+1. Once semester y is finished, I want students & others to be able to always view the site on Web and have links to it continue to work. But there will not be any editing or changing of anything on that site. I don’t want to keep it in a live WP install since that’s work to maintain & protect. If the metadata can be found out, that’s OK. Don’t need to ever reconstitute the site as a live WP install. OR 2. student has a blog on a school’s WP multisite (think Rampages). After student grads and is gone but hasn’t/doesn’t want to migrate it to their own Domain somewhere, would like to keep a static version on public web but not allow editing.