Link Lint (Broken HTTP links) using Parallel processing

This script is designed to check many different websites for broken links at once and uses a multi-process approach to speed up checking times.

#!/bin/bash
# This script was created by Michael Kubler
# On the 26th of May, 2010
# for ANAT, the Australian Network for Art and Technology
# Title : Link Lint website link checker cronjob
# Description : This should be run as a weekly cronjob checking the anat websites for invalid files or broken links.
# Crontab listing : 20 10 1 * * ~/.cron/LinkLint_site.sh
# Requires : LinkLint
# Optional : The LinkLint information as a site domain
DIR='/home/anat/linklint'
echo "This document is created by the LinkLint cronjob script on $HOST and contains a listing of all the errors on the listed websites" > $DIR/1_all_errors.txt
EMAIL='[email protected]' # Where to send the email
SITES_ARRAY=(w00t.anat.org.au surfacetension.anat.org.au) #THE IMPORTANT BIT! A listing of the sites to check
SITE_DOMAIN='http://office.anat.org.au/linklint' # An address to access the other Link Lint info. Leave the trailing slash. If you don't want the information hosted by Apache to the world you could try including a network address, or local file address if run on your computer.
 
echo '- Running Linklint'
for ix in ${!SITES_ARRAY[*]}
do
 SITE=${SITES_ARRAY[$ix]}
 ((linklint -host $SITE -http -net -doc $DIR/$SITE/ -cache $DIR/$SITE/ /@) &> /dev/null) &
 ## Above is the most important line. It spawns all the LinkLint processes. The & at the end  causes it to be 
 echo Started linklint on $SITE
done
echo
echo '- Displaying number of still running LinkLint processes'
echo ' NB : This may take some time...'
grepfor='linklint'
## This section gives you something to look at when you are waiting for it to process ##
# Note if pgrep isn't working you can try the following to get a listing of the number of running processes.
# ps augx | grep /usr/bin/linklint | grep -v grep | wc -l
while [ `pgrep $grepfor | wc -l` != 0 ];
do
 echo -n `pgrep $grepfor | wc -l`
 echo -n ', '
 sleep 15
 ## Checks every 15 seconds to see how many link lint processes are still running. ##
done
echo '' # Used to add a new line
echo '- Done Link Linting'
## Collating error files
for ix in ${!SITES_ARRAY[*]}
do
 SITE=${SITES_ARRAY[$ix]}
 echo "#########  $SITE  #############" >> $DIR/1_all_errors.txt
 if [ -f $DIR/$SITE/error.txt ]
 then
 ## There's an error (or few), so display the listing of errors.
 echo "$SITE_DOMAIN/$SITE/index.html" >> $DIR/1_all_errors.txt
 cat $DIR/$SITE/error.txt >> $DIR/1_all_errors.txt
 fi
done
echo '- Done collating errors'
## Email a list of all the errors.
CURRENTDATE=$(date '+%d-%m-%Y') #The date (e.g 23-02-2008)
mail -s "[LinkLint] sites errors - $CURRENTDATE" $EMAIL < $DIR/1_all_errors.txt
echo "- Done sending email to '$EMAIL' with the subject '[LinkLint] sites errors - $CURRENTDATE'"
echo "If any sites have errored you can check them individually by running :"
echo "> linklint -host $SITE -http -net -doc $DIR/$SITE/ -cache $DIR/$SITE/ /@"
echo '- Script complete'
exit 0


How to use :

  1. Start off by installing LinkLint. In Ubuntu it’s as simple as ‘sudo apt-get install linklint‘. You’ll have to use Yum if you wear a Red Hat, (or similar distro), or compile from source if your a Gentoo user… But really if you compile everything from source you wouldn’t be reading this now would you?
  2. Set the scripts variables :
    DIR
    : The main directory to store all the cache and error files (a sub-folder is created for each site)
    EMAIL
    : Pretty basic, put the email address you want to receive the primary errors to
    SITES_ARRAY: This is a space separated list of sites. The example above is only a small number of the sites we run (and check).
    grepfor : Note that if the LinkLint process isn’t running as it’s name, or pgrep (process grep) can’t find it, then you might need to tweak this setting.
  3. Run it and ensure it works
    You’ll need to ensure you have give the script executable permissions e.g
    sudo chmod u+x linklint_script.sh
    Then you can run it
    ./linklint_script.sh
  4. You should get some output on the command line like the following
    Started linklint on w00t.anat.org.au
    Started linklint on www.anat.org.au
    Started linklint on synapse.net.au
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, – Done LinkLinting
    – Done collating errors 

     

     

     

  5. Check your email, you should have received an email with a list of the domains and if there is any sites with errors it will display a brief copy, and you should be able to click the links (or if your email reader doesn’t auto link text emails, you’ll have to copy->paste) so that you can see the more descriptive cross referenced errors and the like.

I suggest having a little play with LinkLint as you will likely need to run it on a single site at various occasions. This is especially true when the site has a lot of issues and you want to know if your changes fixed them. An example was a WordPress website which managed to direct itself to itself in a way that recursively added a link to a random, non-existent image then would attempt to re-scan the entire site again but with the image link as it’s new root, adding an extra copy each recursion until it broke the ~4,000 character http URL limit, and didn’t make Apache (or Webalizer) very happy. I suspect that the issue was only noticeable to reasonably simple website crawlers, but given the problem it likely also caused SEO issues with Google, Yahoo and the like.

It should also be mentioned that this is a reasonable way to put your server under minor load, but over a wide range (compared to running Apache benchmark which puts your server under high load but usually to a single point). The downside is that it can be hard to manage your log files.

Webalizer : If you are using a log file analyser such as Webalizer (or AWStats), then you will probably need to tell it to ignore the User Agent “LinkLint-spider/x.x.x”. The x’s indicate the version number, which is ‘2.3.5’ at the time of writing.

As a little extra, here is a graph of the number of running Linklint instances.

 

Graph created using Flot

This shows that one instance took over 16 minutes to complete. The site in question was Filter.org.au and is a decent sized wordpress website with lots of little links and images, such as a comments version of the same page and the like.

 

History

ANAT has a LOT of sub domains (over 58) which it has accumulated over the years.
Some are ad-hoc HTML pages, some are custom coded PHP pages, a number of them are WordPress sites.
As the resident tech, trying to ensure all of the sites are functioning correctly can be hard, especially after migrating to a new server… Something we have done over 3 times in the last 12 months.

Bring in Link Lint, one of the many Linux based HTML link checkers. It will scan the selected website for a variety of link issues and present the information in different ways. It’s also reasonably fast for what it’s doing… but not when you are scanning lots of domains.

In standard tech fashion Dale and I got sick of waiting for a basic serial script to run. It was a simple for loop which would run LinkLint on a site, then once it had been completed would run it on the next site.

Some bash scripting fu later and it now creates a new LinkLint process for each site, dramatically speeding the script up, so we now are only limited by the slowest website, not the total time of all websites.

The main downside is that this puts a much greater load on the server. So if you are trying to run 300 sites from a solar powered graphics calculator which you’ve custom modded to run a very very minimal LAMP stack, then this will probably over power the server.

About michaelkubler

Michael Kubler was the Technical officer for ANAT during 2010. He is a Linux system admin, PHP programmer, and innovative geek who wants to change the world using disruptive innovation.
This entry was posted in Code. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *