fbpx

Linkrot: what to do?

August 27, 2003  |  Edward Tufte
27 Comment(s)

In our older threads (“older” on the internet means about 3 months), the links to external material start to break. Especially important material is sometimes imported to our server. This republication raises copyright issues. Graphical snippets may fall under fair use provisions. For the complete articles in the NEW section, we have had good success in obtaining copyright permissions, although in most of those cases we have had some previous relationship with the copyright holder (and even then it has taken several letters to settle on the terms of the permission). Only once has external posted material been challenged (Boeing claimed their Columbia slide was proprietary information). But do we go through the many many links in our threads to get permissions so the material can be posted directly on our server? There are now about 500 surviving threads in the 2 years of this board, with perhaps 500 links wandering around. Or do broken links lead to discarding the thread?

Are there some scholars of linkrot and its cures that can give some advice? Examples of practices at similar sites would be helpful. Some other boards seem to operate on the idea of “don’t ask, don’t tell,” or “don’t ask permission but beg forgiveness.” But are there better ways?

Topics: E.T.
Comments
  • Steve Sprague says:

    It’s a real Goofus and Gallant dilemma you have here: forgiveness or permission?

    A couple of thoughts which may or may not provide an ultimate solution, or even one that passes legal muster:

    Keep the threads, please, broken links notwithstanding. This forum is gathering a fair bit of critical intellectual mass as a resource on a wide range of fronts (there have been significant threads touching on matters aesthetic, practical, commercial, political and more) and, as you mentioned in another thread, many of the older threads remain valid and active even after your monthly merit reviews. It would be a shame to lose them. (I’m sure I’m not the only one who has received “off the board” emails asking for extra information or an additional comment regarding an older thread.) If a concept is thoroughly discussed, and an illuminating link fails, an interested person will still get the idea and may be able to find that link on a new site or a similar link somewhere else. One thing you might think about is tagging known inactive links as such as they present themselves. People will still see where the link led and can perhaps find it again. If the link is the very fiber of the thread, then …

    you might be able to distill a given link into a pdf document you can then post in the thread as a link, not an image. You can lock these files down so they can’t be printed or copied; I’m not sure if this will then elevate the posting into the fair use category and relieve you of the permissions burden.

    I realize these notions add to the work of captain and crew, but they may be worthwhile for some important threads. Perhaps there is some way we, humble beneficiaries all, can help.

    And with a wink: schools seem to have a bit more leeway in addressing the fair use of documents. Maybe you can get Yale to call this forum Intro to Info 101 or something. Goodness knows we’ve all been going to school here for the past two years.

  • Jim Landis says:

    I have no solution, but I admire the problem.

    I chuckle every time the local NPR station mentions that a story will be “archived for one week” on their website. Changes the meaning of “archive”, no?

    I believe that threads continue to be useful even though some of the links have expired. Several current threads have links to “premium” or “by subscription only” content, (e.g. NY Times), and yet the discussion in those threads remains valuable even if the links are not accessible.

  • Jeffrey Berg says:

    I have no solution either, at least not in the short term for this board.

    Here’s some of my thoughts on the issue.

    Replicating the data to the local server only temporarily solves the problem anyhow.

    Replication seriously undermines the authority and validity of the original document. How can a viewer be sure that your copy isn’t a copy of a copy of a copy and accidentally error prone. How can we be sure your copy is the most updated version of the document.

    The Internet is decentralized by nature, always changing, files always moving. The problem is the pointers to those files cannot change, they must remain absolute. Any changes in file location require the manual rewriting of the absolute links. A fundamental shift from centralized server based file pointing needs to take place to solve the issue.

    My opinion is that the current centralized method, or absolute file linking to decentralized systems, cannot solve the problem of academic and intellectual property rights violations.

    Although a lot of work would be needed to perfect the model, an example of an extremely successful model of decentralized serverless file storage and retrieval is the Gnutella network, a large scale peer to peer file sharing system known to most as Kazaa and Napster. BUT don’t let the reputation scare you!

    Combined with file author validation schema, Gnutella style networking would allow for redundant, non-centralized file storage. A validation solution could be implemented in the meta data information at the beginning of the files being stored. A validation system could be propagated at file request time. The file can be “owned” by a licensed author, but because the file storage system is decentralized, posting a file for consumption would mean merely adding the file to decentralized public domain system. If one wished not to share files in a such a manner and had copyright concerns, they simply wouldn’t enter the file into the public Gnutella space. Instead they would offer the file in a more protected manner. In such a system, clear lines would be discernible as to whether or not a particular file was legal for distribution and remote linking. Version validation in the meta data of the files would maintain the author’s authority over the file and insure file integrity. By viewing the meta data, the user or receiver of the file would know exactly what version he or she has received. Recently, the RIAA has been using file meta data to prove the true originating source of mp3 files found on music pirate servers.

    Using this same validation scheme the RIAA is using could also root out files placed on the system by users who are not the original author, and therefore be declared invalid copies.

    A paper on some of the issues and technical workings of Gnutella can be found here: http://people.cs.uchicago.edu/~matei/PAPERS/gnutella-rc.pdf

    A Google search reveals tons of information.

  • Marshall Votta says:

    “Tons of information,” of course, is not helpful. The man asks for best practices, not low resolution search results.

    I would link to the corresponding Google cache or Wayback Machine entry as the first resort; failing that, I’d seek permission and capture to HTML, PDF, or image. (As a single document capture that both preserves and perseveres, PDF seems to be the prevailing favorite.)

    Managing to keep up with each link’s survival, even on a site this small, is tedious work, but I don’t know of any software out there that accurately ferrets out what links yet live, no less replaces them.

  • Harald Groven says:

    One of the main reasons it took as much as 20-30 years from invention of hypertext to the growth of the WWW, is because of the issue of linkrot. Early hypertext pioneers could not even imagine making a hypertext system permitting links to lead nowhere. (Now referred to as link rot or “404”).

    One of Tim Berners-Lee’s major breakthroughs when inventing the Web was simply to ignore this constraint. The Web went thorough a take off because of it’s distributed linking model, but the same decentralized model also made link rot possible. If webmasters had received more inspiration from librarians, and gotten less bad habits from limitations in software, the whole problem with link rot would be reduced. Tim Berners-Lee has discussed the underlying reasons for linkrot and advocated persistent URLs and a more future proof addressing scheme on the web.

    What to do?

    Jakob Nielsen discussed how to fight linkrot in an article in 1998. The long term solution of this problem is to add a layer of abstraction between a web page and its location. That’s the strategy behind peer-to-peer networks and the W3C’s URI standard. There’s not a good long-term solution to address all web documents according to its filename and location in the file system. (Database generated URLs are even worse…) Today’s situation is much like having to know the exact location of a in the library shelves, instead being able to find it by its Dewey number.

  • Bob Hunter says:

    Perhaps this will help:
    Juicy Studio Link Analyser

    Their introduction:

    Dead links can be frustrating for you visitors. The Juicy Studio Link Analyser will test the links on a page, and report links that resolve successfully, links that timeout, and links that cannot be found. Some pages contain a lot of content. Linked pages with heavy content are likely to result in a time out report. There are several reasons why a page cannot be found, but the most likely reason is that it has been removed. Sometimes, pages can’t be found because of problems with the Web server. Visiting the site in question may confirm this; with a further visit later to determine if the disturbance is temporary.

  • Mitchell N Charity says:

    A couple of quick thoughts.

    First, save a private archive copy of everything you link to. When the link breaks, this will help you decide if you care, provide search text for google (to find relocated docs), and contact clues. And you will have the content if you do someday decide on an “optimistic” copyright strategy, or wish to pull out fair use samples.

    Second, at least some of the links in the Ask ET section are created by the poster, as part of the posting. Inline image links in particular. These at least might reasonably be cached by default. One might perhaps also include a button on the submission form vaguely like “The linked material is mine, and is part of the post (permission to cache)”.

    Miscellaneously… Automate watching the links, as some sites will provide relocation information for a time. This can also be used to archive document upgrades. Some kinds of links you know ahead of time are likely to break (eg, personal and course urls at academic institutions, articles at small newspapers), and you can immediately send a form letter “…doc useful…link seems likely to someday break…to maintain the integrity of our site…may we have permission, should the link someday break, to temporarily use a cached copy until you find a new location for the content?”. Most individuals say yes. Discarding threads because of broken links may sometimes cause difficulties — I will visit Ask ET looking for threads I vaguely remember reading months/years ago. But labeling broken links as “(broken)” is helpful. Finally, the “Wayback Machine” archive may or may not be of some use (http://www.archive.org/).

    Mitchell Charity

  • Scott Zetlan says:

    The first step in reducing linkrot would be to implement some system whereby it can
    be identified. An extension to the existing thread posting mechanism should be able
    to scan posts for links and store the links and their thread IDs somewhere; then a
    separate program could scan those links once a day to see which servers respond
    with a “page has moved” error, which with a “page not found”, and so on, along with
    the good links. HTTP has many useful return codes.

    Automating that checking process would, first of all, give you an idea of link half-life.
    Secondly, since many sites leave a redirect in place of relocated links, the “page has
    moved” trap would allow this forum to update the links automatically. I would think
    that adjusting pointers is within the realm of editorial integrity.

    There would be some additional delay during posting, and the site would require a bit
    more maintenance.

    I’d be curious to see how effective this strategy is — it would be the remaining
    percentage of links that would require the most effort. Perhaps an investment in
    archive storage for the most promising documents, as defined by the editors, would
    become valuable?

  • Mark Reilly says:

    Here is a link to some current news about the ‘Peridot Project’ that addresses some of the issues discussed above.

    The BBC News article states: ‘Peridot is a green gemstone which, legend has it, was used in ancient cultures to help people find what they had lost.’ If only…..

    http://news.bbc.co.uk/1/hi/technology/3666660.stm

  • Paco Nathan says:

    Adding to what Scott Zetlan described, there are some tools which help
    significantly, such as

    MOMspider

    by Roy Fielding.

    From experience in managing a large, active email list over several
    years, links often tend to “shift” rather than “rot”. If so, their
    servers may provide information about redirected URLs … for a while. When you have reports from something like MOMspider running, you can catch and adjust links as they move. Of course, that won’t solve the problem of content changing, but some of the content caches which Marshall Votta mentioned are rather amazing/embarrassing in terms of what they manage to preserve.

    Google lists a directory of

    link validator

    tools akin to MOMspider.

    Speaking of content that changes, I’d recommend adding the following
    snippet of HTML to the “Ask E.T.” web pages, in the head section.
    This will make your RSS feed operate with the new “Live Bookmarks”
    feature in Mozilla Firefox browsers, so that readers will be able to
    see new answers being added.

    
    <link
     rel="alternate"
     type="application/rss+xml"
     title="Ask E.T.New Answers"
     href="https://www.edwardtufte.com/bboard/asket.xml"
    />
    
    

  • Edward Tufte says:

    Any chance that a Kindly Contributor could search through some of the major threads
    here, check the links, and update the broken links? We could then move the updated links
    up to the original contribution. I have in mind a nice gift for such a Kindly Contributor.

    ET

  • Edward Tufte says:

    There may be ways to repair broken links to NYTimes stories; here is the link to Wired News:

    http://www.wired.com/news/print/0,1294,64110,00.html

    This story deals with the lack of links to NYTimes stories in current events searches (also a problem with The Wall Street Journal) on Google and then describes some backdoor ways to maintain persistent links. The problem arises because NYT and WSJ require registration and payment for access to archives.

    But I do not know whether these backdoor methods still work. I hope that our most Kindly
    Contributors can check this out when encountering broken NYTimes links. Perhaps the
    method can be generalized to apply to other news archives.

    After the linkrot survey is complete, we’ll integrate the corrected links into the
    original threads for the convenience of our readers (who produced about 40,000 sessions
    and 90,000 page views last week on the site as a whole, with about 65% to the forum).

    Again I’m so thankful to Martin and Niels for their detective work. Others are welcome to
    help; there’s a lot to do.

  • Sean Gerety says:

    Microsoft has some accessories for IE, one being a link list and image list tool. With it you can right click and get all of the links or all of the images on a page. It’s a handy tool to help in the search for link rot.

    http://www.microsoft.com/windows/ie/previous/webaccess/default.mspx

    Sean

  • Niels Olson says:

    Here’s the back door for bloggers at the New York Times. Users of this site could also use this service. This seems to be a very good deal that Aaron Swartz wrote it specifically for the New York Times.

  • Patrick Gaskill says:

    I’ve written a small Perl program that can scan the entire contents of the forum, checking for bad links as it goes. It could conceivably be used to modify dead URLs into a link pointing to archive.org, or even add a small icon or text next to the link showing that its status is in question. It’s a small thing, really, but I wanted to contribute something to the cause.

  • Edward Tufte says:

    This work is enormously helpful to the viewers of the forum. Thank you.

    ET

  • Niels Olson says:

    Some dead links are quite creative, like the California Highway Patrol stopping distances. The webmaster effectively deleted two pages by undermining his own links: he renamed the pages from “http://…/blahblah.html” to “http://…/blahblah.html.old” but those pages are still on the server. Even Google missed it. Others, like the New York Times are highly variable in their degree of displacement from the original location. So far it appears media outlets have the highest failure rate, followed by government and museums. These are three (probably the top three) I would expect to employ people called “archivists” whose job is presumably to ensure the availability of information.

  • Niels Olson says:

    Freshmeat lists 66 projects in its link checking category.

  • Edward Tufte says:

    In the next few weeks, we’ll be incorporating the updated links back up into the original
    contributions. Thank you Kindly Contributors!

    ET

    P.S. Can Martin or Niels please compile a list of all those who contributed, along with mailing
    addresses, so that I can send them a little thank-you gift? Please send the list to Elaine
    Morse at Graphics Press who is handling the link-fixing project.

  • Martin Ternouth says:

    I don’t wish to monopolise this thread, but I have just come across a similar issue: email
    rot. I have just sent a note to a selection of people I have corresponded with on this site
    and nine of the email addresses have failed. An insignificant number compared to the
    total, but if this site keeps going for several years then many more will die. They will be
    impossible to recover and a trail back to the originators of some wonderful ideas will have
    been lost.

    Haven’t a clue what to do about it, but it must be a common problem.

  • Scott Zetlan says:

    Looking back over this thread, I realized no one mentioned the “Wayback Machine,” which can be found here.

  • Francois says:

    Scott: archive.org has been mentioned earlier.

    Keeping track of content obeys at least two parameters, solidity of the method and its ease of use.

    * Linking is the easiest, and the less viable because of linkrot (which, I think, should be considered a fatality).

    ** Saving screenshots with links has been automated by some social bookmarking websites, possibly furl or blogmarks.

    *** The safest and most exhaustive method right now is to keep a Safari .webarchive file somewhere as a backup.

    **** Ideally, browser extensions could automate the saving of the following information: Web address, Web archive, timestamp. And possibly share it?

    Method 2** is enough for saving text/pictures but fails animated content. Method 3*** is rather efficient but less practical. Method 1* stays the favourite for reasons of convenience and copyright issues for countries in which private backups are threatened by harsh IP laws.

  • Jack Johnson says:

    I think Marshall Votta’s answer about linking to the Internet Archive copy
    in the Wayback Machine or Google’s cached copy is a great
    idea, though it may be challenging to find archived copies of topical information.

    I also like the various suggestions about walking the links using any of the various link validation tools, but more
    often than not when a link breaks it’s a catastrophic break: rather than the information getting moved to a new section
    of the Web site the information disappears altogether. Doctoral students graduate, topical information becomes off-
    topic or just plain old news, and yes, archives run out of space. So, if we really want the information to linger, we’re left
    with becoming our own librarians (with a non-sarcastic thank you going out to the Internet Archive injected here).

    What may be more useful than finding out when our links break would be to have some kind of feedback to know
    when they’re starting to age out. I can imagine using a snippet of Javascript to change the alpha on links as they age,
    maybe leveraging a non-standard date attribute on the anchor tag, and as either the link creation date or the target
    page age the link starts to fade. Links that are fresh look fresh. Links that become hard to read are a sign that they
    may become very hard to read if someone doesn’t save a copy quick.

  • Eric Wright says:

    This isn’t a viable solution currently, but more an idea about how this problem might be solved more permanently in the future:

    In his Aug 2006 lecture at Google, Van Jacobson describes a revision of our current networking model on the internet.
    In brief, it would mean moving away from a system based on location to a system based on data. When data is uniquely identified, its source can be distributed, which means the system is more robust and problems like link rot (or even lack of connectivity) are reduced.

  • Niels Olson says:

    In 1998 Tim Berners-Lee, who set up the first web server, attempted to address the problem of linkrot by educating site admins on how to avoid it: Cool URIs don’t change.

  • Eddie says:

    Dr. Tufte,

    Your name came up by a physicist who was commenting recently on Stanford Professor Larry Lessig’s style of using Apple’s Keynote (which is essentially a tool that is nearly identical to PowerPoint) to deliver presentations. Lessig’s style is quite minimalist but it seems to be very effective in capturing people’s attention.

    The physicist used the Lessig style and he claims he got great results with his audience. The link to this blog post on Lessig’s blog is here:

    http://lessig.org/blog/2008/04/a_physicist_on_the_lessig_styl.html

    I would really find it interesting what you make of the Lessig style and if, in fact, a hybrid Lessig – Tufte approach to giving presentations could be a match made in heaven?

    Thanks,

    Eddie

Contribute

Leave a Reply