The up-to-date game

Maintaining a site with close to 18,000 pictures, 11,000+ video links and close to 3,000 articles means having a lot of stuff to keep up-to-date

Published on Sep 29. 2021 - 4 years ago
Updated or edited 10 months ago

About GFF

Nerd alert

This is NOT about fly fishing or fly tying, but about site development and nerdy stuff.

No comments yet

Martin Joergensen

The dreaded 404-page

Martin Joergensen

One of my pet peeves when doing web sites (which I do for a living) is that most site owners and developers don’t care the least about “link rot”.
Link rot is a term used for links that either disappear or change in such a way that they can’t be reached anymore. This means that when people like me link to them in an article, the poor user clicking on the link will be met with an error.
I have written about this phenomenon before in the article “Preserving old URLs” where I described how I have tried to keep all old GFF URLs alive, and tried to make sure that incoming links from other sites and search engines never get stuck with the dreaded 404 error, indicating that the page hasn’t been found.

You loose visitors, you loose respect and you loose business when a potential customer hits a page that says: sorry, not found!
It’s actually not that difficult to catch these errors in an intelligent way, but I have yet to see other developers who systematically make sure that their client’s sites don’t break completely when upgrading or moving to a new system.

Now, I’m in both ends of the field here: I’m both the developer who maintains the site (and its URLs) and the editor who links to other sites and try to keep these links intact and valuable to my readers.
The first part is under my full control, the last one not so much… in spite of me spending an inordinate amount of time chasing broken links and trying to update them. I have no idea how many outgoing links this site has, but a simple and quick search for http in the database table containing body texts, shows me that at least a 10th of the 37,361 texts have links in them. A major part have more than one.

A fairly simple SQL-command actually counts all the occurrences of http in the body field in Drupal 7:

SELECT SUM(ROUND((LENGTH(body_value) - LENGTH(REPLACE(body_value, "http", ""))) / LENGTH("http"))) FROM field_data_body;

It’s pretty simple: it takes the length of the original text, replaces the http’s in it and gets the length again, and then divides the difference with the length of http. Violá!
In my case this results in the number 16,554. So there are most likely more than 16,000 outgoing links in my content! I can even replace http with https and get that of these links, 7,142 are looking for https-pages.
Source: https://stackoverflow.com/questions/12344795/count-the-number-of-occurrences-of-a-string-in-a-varchar-field

Link checker log

Martin Joergensen

OK, enough of that and back to the main case: I have a ton of broken links in my content, and only discover them when I by chance click one that fails or when a reader tells me that it does. That’s a very unsystematic and inefficient way of finding errors and not one that’s even remotely likely to spot them all.
Luckily there’s help to be found from our friends the machines – and the people who develop programs for these machines. By installing the Drupal 7 module Link checker I get the systematic approach that I want. This module will look at text and link fields in nodes, and one by one try to access them and report back if they fail. I get notified in one of several ways: as a warning when I edit the node, as a long list with all the failed links or by looking in the log where the module reports its results.

Link checker options

Martin Joergensen

Link checker can be enabled per content type, and can be set up to run on cron, and methodically go through all the selected content types and report the errors. It can even be set to correct them if a “permanently moved 301” is found. In this case the failing site has done the right thing: showing all browsers and user agents that a page has moved, and showing them where to look for the new page (GFF does the same thing with outdated links). This enables Link checker to simply replace the failing link with the new and correct one, which relieves me from going through the almost endless list (980 failed links right now) and doing the same thing manually.
Of the 980 failed links, more than 730 gave a 301 error, and this means that Link checker can most likely fix the problem automatically. Yay!
This is a slow process. I have set the checker to look at links once a month, and it takes a small bunch at the time, so a full check will take a while. But it gives me a fair chance of fixing the errors that other sites have introduced by moving pages to new URLs. The only ones lost are the ones where they are not properly redirecting the users. These I have to look after manually, and edit - or delete if they are completely gone.

WARNING!

Be warned: I set up Link checker to automatically update nodes per the settings mentioned above, and slowly and systematically it severely messed up more than 150 articles on the site before I found out what was going on!
I will seriously urge you to test this function before using it on a production site, and make sure it doesn't do anything similar to your content.
As I have mentioned elsewhere I use some custom markup on most GFF pages, which helps the system place images in the proper and desired manner. Link checker obviously doesn't like these codes - or for some reason decides to change them as well as the links, which is not good - and definitely not a thing it should do. Essentially Link checker should touch nothing but the faulty URL's, but it does. The result in my case was about 150 articles, which would either not load at all or only load partially and leave the user with a HTTP 500 Internal Server Error. This was error was caused by my image code script, but it was sparked by the corrupted image markup that Link checker saved in the nodes. I have no idea why this happened. The image markup is semantically correct, and should pass through any code handling smoothly (and untouched/unchanged), but that wasn't the case. Luckily I have revisions on all content, and could revert all the articles back to the latest functional version, but it was a long and tedious piece of manual work!.
So: activate revisions and test Link checker's automatic function thoroughly before releasing it on live data!

The revert game

Martin Joergensen

Looking for dead videos

Martin Joergensen

But… that’s only the links. I also have more than 11,000 videos embedded in the video channel, and those also whither and disappear. Many videos get taken down, are hidden from the public or moved to a new location. All this means that the video won’t show on GFF, and the user will see… nothing.
To avoid this, I have made my own little Video checker, which will look at a number of random videos at a time, try to fetch them from YouTube or Vimeo, and if they fail, simply unpublish them on the site and mark them as “unfound” in a field made for the purpose. One less video to see, sure, but also one less failing video, and hopefully fewer frustrated users.

My last update task is one that is my own fault. This site has existed since 1994, and back then showing images online was a different game compared to now. Images were much smaller and the quality was lower. And disk space was a scarce resource, and for that reason I didn’t upload large images, but meticulously – and manually – scaled and compressed all images to the size needed.
Nowadays the CMS takes care of all that. I upload all images in a nice, large size, and Drupal’s image handling as well as my own routines will scale them down to the proper size needed. And the original will still be available, meaning that if a new size is needed, a new copy can be made in the right size from that.
But there’s a whole lot of articles from before that era, which all have (too) small images. I have made a script that finds some of them, and when I locate one, I try to find the original in my archive and replace it with a sufficiently larger and “smart” image, which Drupal can scale. In other cases there are no originals to be found, and I’ll sometimes have to upscale the small copy, usually a no-no in image handling, but sometimes necessary in order to get the image to fill the space set off for it.