Link Archiver

(Pardon my verbification.)

Here’s an idea for any website, though it could be particularly applicable to weblogs. I’m a reading junkie, I can’t get enough. When I come across a blog I like I often go back in its archives, which is a great way to get a feel for a site. It’s fascinating to see how some blogs have evolved over the years, how posting styles evolve, and to see what people were thinking around the time of important events.

There is one common thread in every archive I browse, I constantly run across dead links. Long-dead links. Dead permalinks even. I have read that the average life-span of a web page is 100 days—I think that may be generous. What good is the wonderful archiving of modern weblog software if those archives become irrelevant less than a year after they’re written?

I think the answer lies in some kind of automatic archiving of all linked content. When you publish a new post an intelligent spider tied to your blog engine could go and grab the content of the page you link to and store it locally. Once a week the spider checks all links on your weblog and if the resource no longer exists it updates the link in the entry to point to the locally archived version. The local archive would have a disclaimer and a link to the original location of the resource, much like Google’s cache. The link in the entry could also be modified in some way, perhaps with a different CSS class or rel value than normal links. The engine could also alert you so you could be sure to be wary of that website or publisher in the future.

How hard would this be? I know there are copyright issues that I’m ignoring, but I don’t forsee that being something that would hold this back. I doubt copyright holders who can’t keep their URIs cool are going to devote many resources to tracking down blogs violating their missing content. Besides, this could be covered.

I could see this done as a centralized service, something like Technorati meets Furl, but that would really defeat the purpose. Decentralization is the path of the future.

10 thoughts on “Link Archiver

  1. Like your own personal web.archive.org, that indexed resources at the time you linked to them. The Internet Archive is great, but it’s slow (in indexing and retrieving) and if you tied the solution to them you’d be putting all your eggs in one basket. Granted, it’s a good basket, but a distributed solution would be better.

  2. Much like Waypath, only distributed. Storing every bit of every post would become costly very quickly.
    Here is something I have been thinking about. If I own a blogroll (or something similar), I could archive the first 20 lines (or more lines, depending on the user) of every post from the blogs on my blogroll and make that a searchable part of my blog. With an interface into this archive having a common purpose (such as a published XMLRPC or REST method), searches could be tied together in p2p fashion (or even among blogs in the blogrolls)

  3. Pingback: Mindful Musings
  4. Indeed, as you have alluded to, this is a feature we have thought about for Furl. True, it is a centralized service which has its pros (i.e. you don’t have to manage it) and cons (i.e. you don’t get to manage it), but I don’t think the issue would be data storage (disk space is cheap). Would love to chat more about your vision as it is something I have put thought into as well.

  5. I don’t think centralization would defeat the purpose. Simply having two copies of the content would greatly increase the chances of it surviving for a greater time. And for people like me, having a third party control the backup copy would actually increase the probability that it will always be available 😀

  6. This idea is sort of available to Movable Type users, via the Cache plugin. It stores a local snapshot of any links you flag in your entries, but doesn’t perform the weekly spidering that you talk about.

  7. Mark G, there’s no reason to restrict this to blogrolled sites. I could be missing your point though.

    Mike, I’d love to chat.

    Chris, for most people having someone else manage it is probably the best bet. However if there were a hosted service, I would want an up-to-date copy of all my data on my server. It’s the libertarian in me. 😉

    David, I checked out the MT Cache plugin and it essential worked how I envisioned a simple version of this working, in that it called wget and modified links using a text filter. However I think the idea could be expanded much further, most usefully with the automatic detection of when the cache should kick in, not relying on programs like wget to do the dirty work, and cataloging the archives somehow.

  8. The Internet Wayback Machine is open source now… perhaps you could have this script update the archive at a predetermined time each week… insuring you have the most recent copy…

  9. I like the fact that I can ask Google to remove their cached versions of my old and deleted pages. The idea of *any* little webpage out there caching my content and completely removing it from my control is rather terrifying. There’d have to be a way to deny access to one’s site via robots.txt or similar methods.

SHARE YOUR THOUGHTS