Linkrot: what to do?
In our older threads (“older” on the internet means about 3 months), the links to external material start to break. Especially important material is sometimes imported to our server. This republication raises copyright issues. Graphical snippets may fall under fair use provisions. For the complete articles in the NEW section, we have had good success in obtaining copyright permissions, although in most of those cases we have had some previous relationship with the copyright holder (and even then it has taken several letters to settle on the terms of the permission). Only once has external posted material been challenged (Boeing claimed their Columbia slide was proprietary information). But do we go through the many many links in our threads to get permissions so the material can be posted directly on our server? There are now about 500 surviving threads in the 2 years of this board, with perhaps 500 links wandering around. Or do broken links lead to discarding the thread?
Are there some scholars of linkrot and its cures that can give some advice? Examples of practices at similar sites would be helpful. Some other boards seem to operate on the idea of “don’t ask, don’t tell,” or “don’t ask permission but beg forgiveness.” But are there better ways?
It’s a real Goofus and Gallant dilemma you have here: forgiveness or permission?
A couple of thoughts which may or may not provide an ultimate solution, or even one that passes legal muster:
Keep the threads, please, broken links notwithstanding. This forum is gathering a fair bit of critical intellectual mass as a resource on a wide range of fronts (there have been significant threads touching on matters aesthetic, practical, commercial, political and more) and, as you mentioned in another thread, many of the older threads remain valid and active even after your monthly merit reviews. It would be a shame to lose them. (I’m sure I’m not the only one who has received “off the board” emails asking for extra information or an additional comment regarding an older thread.) If a concept is thoroughly discussed, and an illuminating link fails, an interested person will still get the idea and may be able to find that link on a new site or a similar link somewhere else. One thing you might think about is tagging known inactive links as such as they present themselves. People will still see where the link led and can perhaps find it again. If the link is the very fiber of the thread, then …
you might be able to distill a given link into a pdf document you can then post in the thread as a link, not an image. You can lock these files down so they can’t be printed or copied; I’m not sure if this will then elevate the posting into the fair use category and relieve you of the permissions burden.
I realize these notions add to the work of captain and crew, but they may be worthwhile for some important threads. Perhaps there is some way we, humble beneficiaries all, can help.
And with a wink: schools seem to have a bit more leeway in addressing the fair use of documents. Maybe you can get Yale to call this forum Intro to Info 101 or something. Goodness knows we’ve all been going to school here for the past two years.
I have no solution, but I admire the problem.
I chuckle every time the local NPR station mentions that a story will be “archived for one week” on their website. Changes the meaning of “archive”, no?
I believe that threads continue to be useful even though some of the links have expired. Several current threads have links to “premium” or “by subscription only” content, (e.g. NY Times), and yet the discussion in those threads remains valuable even if the links are not accessible.
I have no solution either, at least not in the short term for this board.
Here’s some of my thoughts on the issue.
Replicating the data to the local server only temporarily solves the problem anyhow.
Replication seriously undermines the authority and validity of the original document. How can a viewer be sure that your copy isn’t a copy of a copy of a copy and accidentally error prone. How can we be sure your copy is the most updated version of the document.
The Internet is decentralized by nature, always changing, files always moving. The problem is the pointers to those files cannot change, they must remain absolute. Any changes in file location require the manual rewriting of the absolute links. A fundamental shift from centralized server based file pointing needs to take place to solve the issue.
My opinion is that the current centralized method, or absolute file linking to decentralized systems, cannot solve the problem of academic and intellectual property rights violations.
Although a lot of work would be needed to perfect the model, an example of an extremely successful model of decentralized serverless file storage and retrieval is the Gnutella network, a large scale peer to peer file sharing system known to most as Kazaa and Napster. BUT don’t let the reputation scare you!
Combined with file author validation schema, Gnutella style networking would allow for redundant, non-centralized file storage. A validation solution could be implemented in the meta data information at the beginning of the files being stored. A validation system could be propagated at file request time. The file can be “owned” by a licensed author, but because the file storage system is decentralized, posting a file for consumption would mean merely adding the file to decentralized public domain system. If one wished not to share files in a such a manner and had copyright concerns, they simply wouldn’t enter the file into the public Gnutella space. Instead they would offer the file in a more protected manner. In such a system, clear lines would be discernible as to whether or not a particular file was legal for distribution and remote linking. Version validation in the meta data of the files would maintain the author’s authority over the file and insure file integrity. By viewing the meta data, the user or receiver of the file would know exactly what version he or she has received. Recently, the RIAA has been using file meta data to prove the true originating source of mp3 files found on music pirate servers.
Using this same validation scheme the RIAA is using could also root out files placed on the system by users who are not the original author, and therefore be declared invalid copies.
A paper on some of the issues and technical workings of Gnutella can be found here: http://people.cs.uchicago.edu/~matei/PAPERS/gnutella-rc.pdf
A Google search reveals tons of information.
“Tons of information,” of course, is not helpful. The man asks for best practices, not low resolution search results.
I would link to the corresponding Google cache or Wayback Machine entry as the first resort; failing that, I’d seek permission and capture to HTML, PDF, or image. (As a single document capture that both preserves and perseveres, PDF seems to be the prevailing favorite.)
Managing to keep up with each link’s survival, even on a site this small, is tedious work, but I don’t know of any software out there that accurately ferrets out what links yet live, no less replaces them.
One of the main reasons it took as much as 20-30 years from invention of hypertext to the growth of the WWW, is because of the issue of linkrot. Early hypertext pioneers could not even imagine making a hypertext system permitting links to lead nowhere. (Now referred to as link rot or “404”).
One of Tim Berners-Lee’s major breakthroughs when inventing the Web was simply to ignore this constraint. The Web went thorough a take off because of it’s distributed linking model, but the same decentralized model also made link rot possible. If webmasters had received more inspiration from librarians, and gotten less bad habits from limitations in software, the whole problem with link rot would be reduced. Tim Berners-Lee has discussed the underlying reasons for linkrot and advocated persistent URLs and a more future proof addressing scheme on the web.
What to do?
Jakob Nielsen discussed how to fight linkrot in an article in 1998. The long term solution of this problem is to add a layer of abstraction between a web page and its location. That’s the strategy behind peer-to-peer networks and the W3C’s URI standard. There’s not a good long-term solution to address all web documents according to its filename and location in the file system. (Database generated URLs are even worse…) Today’s situation is much like having to know the exact location of a in the library shelves, instead being able to find it by its Dewey number.
Perhaps this will help:
Juicy Studio Link Analyser
Their introduction:
A couple of quick thoughts.
First, save a private archive copy of everything you link to. When the link breaks, this will help you decide if you care, provide search text for google (to find relocated docs), and contact clues. And you will have the content if you do someday decide on an “optimistic” copyright strategy, or wish to pull out fair use samples.
Second, at least some of the links in the Ask ET section are created by the poster, as part of the posting. Inline image links in particular. These at least might reasonably be cached by default. One might perhaps also include a button on the submission form vaguely like “The linked material is mine, and is part of the post (permission to cache)”.
Miscellaneously… Automate watching the links, as some sites will provide relocation information for a time. This can also be used to archive document upgrades. Some kinds of links you know ahead of time are likely to break (eg, personal and course urls at academic institutions, articles at small newspapers), and you can immediately send a form letter “…doc useful…link seems likely to someday break…to maintain the integrity of our site…may we have permission, should the link someday break, to temporarily use a cached copy until you find a new location for the content?”. Most individuals say yes. Discarding threads because of broken links may sometimes cause difficulties — I will visit Ask ET looking for threads I vaguely remember reading months/years ago. But labeling broken links as “(broken)” is helpful. Finally, the “Wayback Machine” archive may or may not be of some use (http://www.archive.org/).
Mitchell Charity
The first step in reducing linkrot would be to implement some system whereby it can
be identified. An extension to the existing thread posting mechanism should be able
to scan posts for links and store the links and their thread IDs somewhere; then a
separate program could scan those links once a day to see which servers respond
with a “page has moved” error, which with a “page not found”, and so on, along with
the good links. HTTP has many useful return codes.
Automating that checking process would, first of all, give you an idea of link half-life.
Secondly, since many sites leave a redirect in place of relocated links, the “page has
moved” trap would allow this forum to update the links automatically. I would think
that adjusting pointers is within the realm of editorial integrity.
There would be some additional delay during posting, and the site would require a bit
more maintenance.
I’d be curious to see how effective this strategy is — it would be the remaining
percentage of links that would require the most effort. Perhaps an investment in
archive storage for the most promising documents, as defined by the editors, would
become valuable?
Here is a link to some current news about the ‘Peridot Project’ that addresses some of the issues discussed above.
The BBC News article states: ‘Peridot is a green gemstone which, legend has it, was used in ancient cultures to help people find what they had lost.’ If only…..
http://news.bbc.co.uk/1/hi/technology/3666660.stm
http://www.copyright.gov/1201/index.html#hearings
Adding to what Scott Zetlan described, there are some tools which help
significantly, such as
MOMspider
by Roy Fielding.
From experience in managing a large, active email list over several
years, links often tend to “shift” rather than “rot”. If so, their
servers may provide information about redirected URLs … for a while. When you have reports from something like MOMspider running, you can catch and adjust links as they move. Of course, that won’t solve the problem of content changing, but some of the content caches which Marshall Votta mentioned are rather amazing/embarrassing in terms of what they manage to preserve.
Google lists a directory of
link validator
tools akin to MOMspider.
Speaking of content that changes, I’d recommend adding the following
snippet of HTML to the “Ask E.T.” web pages, in the head section.
This will make your RSS feed operate with the new “Live Bookmarks”
feature in Mozilla Firefox browsers, so that readers will be able to
see new answers being added.
Any chance that a Kindly Contributor could search through some of the major threads
here, check the links, and update the broken links? We could then move the updated links
up to the original contribution. I have in mind a nice gift for such a Kindly Contributor.
ET
There may be ways to repair broken links to NYTimes stories; here is the link to Wired News:
http://www.wired.com/news/print/0,1294,64110,00.html
This story deals with the lack of links to NYTimes stories in current events searches (also a problem with The Wall Street Journal) on Google and then describes some backdoor ways to maintain persistent links. The problem arises because NYT and WSJ require registration and payment for access to archives.
But I do not know whether these backdoor methods still work. I hope that our most Kindly
Contributors can check this out when encountering broken NYTimes links. Perhaps the
method can be generalized to apply to other news archives.
After the linkrot survey is complete, we’ll integrate the corrected links into the
original threads for the convenience of our readers (who produced about 40,000 sessions
and 90,000 page views last week on the site as a whole, with about 65% to the forum).
Again I’m so thankful to Martin and Niels for their detective work. Others are welcome to
help; there’s a lot to do.
Microsoft has some accessories for IE, one being a link list and image list tool. With it you can right click and get all of the links or all of the images on a page. It’s a handy tool to help in the search for link rot.
http://www.microsoft.com/windows/ie/previous/webaccess/default.mspx
Sean
Here’s the back door for bloggers at the New York Times. Users of this site could also use this service. This seems to be a very good deal that Aaron Swartz wrote it specifically for the New York Times.
I’ve written a small Perl program that can scan the entire contents of the forum, checking for bad links as it goes. It could conceivably be used to modify dead URLs into a link pointing to archive.org, or even add a small icon or text next to the link showing that its status is in question. It’s a small thing, really, but I wanted to contribute something to the cause.
This work is enormously helpful to the viewers of the forum. Thank you.
ET
Some dead links are quite creative, like the California Highway Patrol stopping distances. The webmaster effectively deleted two pages by undermining his own links: he renamed the pages from “http://…/blahblah.html” to “http://…/blahblah.html.old” but those pages are still on the server. Even Google missed it. Others, like the New York Times are highly variable in their degree of displacement from the original location. So far it appears media outlets have the highest failure rate, followed by government and museums. These are three (probably the top three) I would expect to employ people called “archivists” whose job is presumably to ensure the availability of information.
Freshmeat lists 66 projects in its link checking category.
In the next few weeks, we’ll be incorporating the updated links back up into the original
contributions. Thank you Kindly Contributors!
ET
P.S. Can Martin or Niels please compile a list of all those who contributed, along with mailing
addresses, so that I can send them a little thank-you gift? Please send the list to Elaine
Morse at Graphics Press who is handling the link-fixing project.
I don’t wish to monopolise this thread, but I have just come across a similar issue: email
rot. I have just sent a note to a selection of people I have corresponded with on this site
and nine of the email addresses have failed. An insignificant number compared to the
total, but if this site keeps going for several years then many more will die. They will be
impossible to recover and a trail back to the originators of some wonderful ideas will have
been lost.
Haven’t a clue what to do about it, but it must be a common problem.
Looking back over this thread, I realized no one mentioned the “Wayback Machine,” which can be found here.
Scott: archive.org has been mentioned earlier.
Keeping track of content obeys at least two parameters, solidity of the method and its ease of use.
* Linking is the easiest, and the less viable because of linkrot (which, I think, should be considered a fatality).
** Saving screenshots with links has been automated by some social bookmarking websites, possibly furl or blogmarks.
*** The safest and most exhaustive method right now is to keep a Safari .webarchive file somewhere as a backup.
**** Ideally, browser extensions could automate the saving of the following information: Web address, Web archive, timestamp. And possibly share it?
Method 2** is enough for saving text/pictures but fails animated content. Method 3*** is rather efficient but less practical. Method 1* stays the favourite for reasons of convenience and copyright issues for countries in which private backups are threatened by harsh IP laws.
I think Marshall Votta’s answer about linking to the Internet Archive copy
in the Wayback Machine or Google’s cached copy is a great
idea, though it may be challenging to find archived copies of topical information.
I also like the various suggestions about walking the links using any of the various link validation tools, but more
often than not when a link breaks it’s a catastrophic break: rather than the information getting moved to a new section
of the Web site the information disappears altogether. Doctoral students graduate, topical information becomes off-
topic or just plain old news, and yes, archives run out of space. So, if we really want the information to linger, we’re left
with becoming our own librarians (with a non-sarcastic thank you going out to the Internet Archive injected here).
What may be more useful than finding out when our links break would be to have some kind of feedback to know
when they’re starting to age out. I can imagine using a snippet of Javascript to change the alpha on links as they age,
maybe leveraging a non-standard date attribute on the anchor tag, and as either the link creation date or the target
page age the link starts to fade. Links that are fresh look fresh. Links that become hard to read are a sign that they
may become very hard to read if someone doesn’t save a copy quick.
This isn’t a viable solution currently, but more an idea about how this problem might be solved more permanently in the future:
In his Aug 2006 lecture at Google, Van Jacobson describes a revision of our current networking model on the internet.
In brief, it would mean moving away from a system based on location to a system based on data. When data is uniquely identified, its source can be distributed, which means the system is more robust and problems like link rot (or even lack of connectivity) are reduced.
In 1998 Tim Berners-Lee, who set up the first web server, attempted to address the problem of linkrot by educating site admins on how to avoid it: Cool URIs don’t change.
Dr. Tufte,
Your name came up by a physicist who was commenting recently on Stanford Professor Larry Lessig’s style of using Apple’s Keynote (which is essentially a tool that is nearly identical to PowerPoint) to deliver presentations. Lessig’s style is quite minimalist but it seems to be very effective in capturing people’s attention.
The physicist used the Lessig style and he claims he got great results with his audience. The link to this blog post on Lessig’s blog is here:
http://lessig.org/blog/2008/04/a_physicist_on_the_lessig_styl.html
I would really find it interesting what you make of the Lessig style and if, in fact, a hybrid Lessig – Tufte approach to giving presentations could be a match made in heaven?
Thanks,
Eddie