Things I Hate About the Internet: When the Internet Archive Can’t ArchiveWhen I find something posted on the Web that seems important, one of the first things I do is to check if it’s in the Internet Archive’s Wayback Machine. Fortunately, sometimes it’s already there. Unfortunately, sometimes the actual content isn’t saved.
The early days of the Web were simple. If you wanted to save a page, all you had to do was download it, find any external resources it needed (like images), and download those as well. Nowadays, though, you’ll often find that that only gets you so far. Instead of the actual content, you might get:
A login page: Of course, websites that require a login aren’t new, but there seems to be a trend of “locking down” social-media sites that were previously public. Twitter is a prime example of this: It used to be very easy to archive, but after Musk’s takeover it’s hit-or-miss at best. Some sites, like Facebook, aren’t strictly login-only but will require a login when you submit them to the Internet Archive. Both of these sites are used to publish statements of cultural significance, so it’s quite disheartening to think that they can’t be archived easily. (Update [5 June 2024]: I’d like to highlight Google Docs as well, because it’s another place where people publish culturally significant statements, and also because there’s an easy way to make it much easier to archive. Google Docs doesn’t require a login when you submit a document to the Internet Archive, but it often shows an error message that prevents you from reading the document. If you share something on Google Docs, consider using the “publish to Web” feature, which will make it much easier to archive. Note that this is not the same as simply sharing a link to a public document. The “publish to Web” URL will end in “/pubhtml” rather than “/edit”.)
A blank or partial page with code to load the actual content: I remember when “progressive enhancement” was a thing and “people who turn off JavaScript in their browser” was a demographic that websites actually took into account. This idea seems to be dying off though, and the ability to archive pages is suffering as a result. The Internet Archive says it can run JavaScript to archive pages, but this doesn’t always work. Bluesky is a great example of this: You can view posts without logging in, but if you try to archive a public post all you get is a sort of “loading” screen (although the text of the post is in the source code at least). Its blog does this weird thing where you have to click in order for the content to show up. (Incidentally, archive.today handles both of these a lot better, so consider making it your first choice for archiving Bluesky posts.)
A media player without the media: This is actually a specific case of the previous item. If you try to archive the page for a YouTube video, for example, all you’ll get is the player which uses JavaScript to load the actual video data. The Wayback Machine actually does have some YouTube videos archived, but those are few and far between and it’s clearly some extra process (although the process itself isn’t clear to me). I will point out that the Internet Archive’s video collections also have some YouTube videos, but these are separate from the Wayback Machine and not quite as simple to search for.
Now, to be fair, a lot of the Web is still the kind that you can archive easily. But it’s not a coincidence that all of the examples I gave are from major social-media sites; many of them are particularly unsuitable for archiving. It’s frustrating that much of what happens online takes place on these platforms, without any good way to preserve it.
Tumblr