Oops and bitrot

David Somers bio photo By David Somers Comment

I was doing some administration on the server that hosts this research blog and I noticed a few errors in the web server logs. Normally these are benign, but this time I noticed that they were for some recent journal posts that I made, so something was going wrong.

There two posts that were having “issues” were Seeing Antoni Tàpies, Collection, 1966-1976 and Seeing Harun Farocki, Empathy.

What is strange is that I test my website on a development server and everything is fine, but on the production server the dreaded 404 error was happening when anybody attempted to read these posts. My website is normally pretty bullet-proof. I expect it to work. Internal 404 errors should not happen because the site is generated by Jekyll which takes care of such things. After bit of investigation it seems the production web server I am using does not like pages with accents in the name, and those two posts had an file with an accent in the name. So a quick-and-dirty “fix” for this is to rename the files sans accents, regenerate the site, deploy it to the development server et voila, it works as expected. Note that I referred to the issue happening to the production web server and not the development one. This is because the web server is different between the two, and the issue only affected production. Yes, I should really deploy via an intermediate UAT server running the same environment and processes as production to check for such things. Note that the “fix” is not really a fix because something is going wrong when jekyll generates slugs and it should automatically cope with accented characters; after a bit of further investigation there is an upstream patch to cope with this case.

To scan through the server log and get a nice list of all the inbound 404s, the following command-line incantation was invoked.

awk -F\" '{split($3,ar," "); if(ar[1] == "404")print $2}' mava.log | sort | uniq -c | sort

As well as the two links I discovered above, there are a few others. Most are due to robots probing the site with intentions good and bad; good is looking for universal links; bad is looking for vulnerabilities, typically seeing if certain javascript libraries are installed to subsequently exploit.

Being slightly paranoid about broken links, I then ran a link checker — the most excellent Integrity — on the website itself and discovered a few broken external links. There were 18 in total, including the two above. Looking at these broken links they were due to the following factors:

  • fat fingers, i.e. mistyped URLs
  • out-of-date links, i.e. 301, the resource had moved and I could easily update to the redirected URI.
  • dead links, i.e. the resource was not found and I couldn’t locate where it had gone to.
  • temporary errors, i.e. 503 (service unavailable)

With the benefit of hindsight what I should have done checked my external and internal links more often. While apps such as Integrity are great, what is better is incorporating checks automatically into the build and deploy toolchain instead of manually running an app to do so.

After fixing as many links as possible the research blog has less broken links. In time some may work again due to temporary errors (503) but others are gone forever and I can’t locate where the material has gone. And this has me thinking about “oops and bitrot”. Clearly things that are “oops” can be fixed. But the “bitrot” is more serious.

In the case of Seeing Is it Heavy or Is it Light? at Assembly Point, I had a link to http://ziggygrudzinskas.com/post/138642822910/decompression-chamber-2016-acrylic-dispersion and the material is gone. However, an archive of that website is available on The Internet Archive, and that page was successfully archived, and can be retrieved here.

On one hand it is scary that a link that is barely eight months old has gone dead. On the other hand its great that an archive of the dead link was found on The Internet Archive. However, it can’t be assumed that that will always be the case. Clearly there are issues related to the curation and maintenance not only of this research blog but also to external sites that are referenced.

comments powered by Disqus