Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 4 (July 1996)

Everybody wants a web-hit counter. I'm not sure why, but they all seem to want a web-hit counter, at least on comp.unix.infosystems.www.cgi (and friends).

Well, I don't think web counters are all that much interesting, but one of the problems that every web-counter must face is, in fact, very interesting. I'm talking about how the count itself cannot be updated by more than one access, so a web counter script must deal with simultaneous CGI programs accessing the same data.

So, let's take a look at that problem. How do we keep two CGI programs, both wanting to turn ``1742'' hits into ``1743'' hits, from doing things at the same time, which would either cause one of the hits to be missed (gasp!) or perhaps mess up the count entirely (double gasp!)? Well, we need to cause one (or more) of them to wait while a solitary CGI updates the count appropriately.

One of the easiest ways to do this is with the built-in Perl flock() operator. This operator tells the UNIX kernel that we want to arrange some sort of shared or exclusive access (in this case, exclusive) to a particular open file. If a file is ``flocked'' exclusively, and a second process comes along wanting the same, it gets blocked until the process with the flock releases it (by closing the file, calling flock() again with the proper parameter, or simply exiting).

If your system doesn't support flock() but does have lockf(), the latest Perl (5.002 as of this writing) automatically emulates flock with lockf. If you don't have either, you'll have to resort to more primitive lock techniques, but the concepts here are the same.

So, for this column, I didn't want to write ``yet another web counter'', so I took a slight detour. I have created a CGI script that handles a ``most recent visit log'', keeping track of the 10 most recent distinct hostnames that are visiting the page.

I call this program ``mrv'', for ``most recent visitors''. The source to ``mrv'' is contained in Listing 1 [below].

Lines 1 and 2 are the beginning of all good CGI programs, enabling taint mode (-T), warnings (-w), and ``use strict''.

Line 7 defines a constant $MAX, here 10. This constant selects how many distinct hosts will be remembered in the ``mrv'' file for a particular web page.

Line 10 unbuffers the output, and line 11 sets the execution PATH for any child processes. I didn't use any child processes in the final version, but during testing, I sometimes inserted

        system "/bin/pwd";

to ensure that I was in the right directory. I've left it here to remind you to do the same if you write similar stuff.

Line 14 outputs a standard HTTP header, declaring the remaining output to be of MIME type ``text/plain''. For a CGI script such as this, it's important to get the header right so that it can be included properly.

Nearly all of the remaining program is established within an eval block, beginning in line 17. The purpose of this eval block is to trap any possible reason for exit, and translate it into a nicer looking message, as well as avoid the dreaded ``something went wrong'' error message from the web server. If the block exits normally, $@ (checked in line 75) will be empty. If anything goes wrong (like an internal error, or ``die'' is executed), the block will return immediately, with $@ set to the error message.

Lines 20 and 21 copy the environment variables DOCUMENT_URI and REMOTE_HOST to local variables. This makes it easier to type. DOCUMENT_URI should be the URI of the document from which this CGI is being included, absent any protocol and host prefix (at least it was on Teleport's Apache server). REMOTE_HOST should be the host-name (or host number if the name doesn't translate) that initiated the request.

Line 24 splits up the URI into a directory and file part, using the regular expression match operator. Because the * within a regular expression is ``greedy'', the left-most set of parentheses will surround the longest possible string of characters that still allow the entire regular expression to match. This ensures that the enclosed slash is the rightmost slash. Also note that I am using a comma to delimit the pattern, rather than the traditional slash. The ``s'' suffix causes the ``.'' character to match newlines as well as everything except newlines -- essentially the entire character set.

The value in $dir will be something like /~merlyn/something for entries below my HTML dir. However, this represents the UNIX path /home/merlyn/public_html/something at Teleport, and I need to get to the actual directory of the page being served. Lines 27 to 31 massage a string of the first form into a string of the second form. Obviously, this section is very Teleport specific, so you'll have to hack this appropriately for your site.

The value in $dir at line 34 is the UNIX path containing the HTML file from which this CGI is being included. The next step is to change directory to there. We could have also simply constructed absolute pathnames prefixed with $dir, but for me it was easier to work with an appropriate current directory.

Lines 37 and 38 set up $mrv, which will hold the name of the ``mrv'' file. This file is named the same as the HTML file from which we are being included, but with ``.mrv'' appended to the end. For safety, we restrict the filename characters to alphanumerics and period, throwing away the rest.

Line 39 opens the ``mrv'' file, using the read-write append mode. If the file cannot be opened (or created), an appropriate die gets us out of the eval block again. Append mode is handy because it creates the file if it doesn't exist, but doesn't blow it away if it does.

Line 40 begins the critical region. By invoking ``flock'' with an exclusive lock value set, this process will stop until it can get exclusive access to the file opened on the MRV filehandle. This exclusive access is only amongst all other programs that are also using flock() in the same way--anyone can still read the file (or even write to it) without obtaining a lock. Only one program at a time will be allowed to flock() the file in this manner, so we are guaranteed to be the only process with a flock() immediately after this line.

The flock() will automatically be released upon close, or upon any exit of the program. Thus, the time for which this process has the file exclusively is from here down to line 61. Any other process that comes along while we are between these two lines will be waiting at their own line 40 until we hit line 61.

Once we have the file exclusively, it's just a little file manipulation, being careful not to close the filehandle or remove the file. Line 45 moves the filepointer back to the beginning of the file, and line 46 reads the content into an array variable called @content. This array will contain one entry for each line in the file, and the file contains one line for each host address that has most recently visited this page.

Lines 49 and 50 adjust @content so that it contains entries that do not match $rhost--the host that originally requested the page. Line 51 attaches this host to the end of the (now possibly reduced) list.

Lines 52 and 53 save the rightmost 10 elements of @content (or whatever we've set $MAX to). I'm not sure the ``if'' part was necessary, because the third parameter to splice() will go negative in that case, but I put it in there just to be obvious as to what I'm doing. There's probably 35 other ways to do this in Perl, so don't sweat it if you don't write it this way.

Lines 56 through 58 rewrite the updated value of @content back out to the file. Here, it's important to again seek to the beginning of the file (line 56), and delete the existing contents (line 57) before writing the data (line 58). Note that we cannot simply say:

        open MRV, ">$mrv";

in this case, because it would cause us to lose the flock() on the filehandle, and that flock() is precious. When most people report intermittent behavior on simultaneous access, it's because they haven't considered stuff like this, so be careful.

Finally, line 61 closes the filehandle, which also frees the flock(). We're done with the file, and we have the information we need, so why not?

Lines 65 to 70 transform the @content array into something nice and English-like. If the array has only one element, it's used as-is. For two elements, a single ``and'' is inserted. For more than two elements, the list is concatenated with comma-space's followed by the final ``and'' and the final element.

Line 75 is the error-catcher. If the eval block exits normally, $@ is empty here, and the ``if'' test fails. If $@ has something in it, the eval block has exited abnormally. The print message encloses the error with braces and an identifying tag so that the user can see what happened. If you'd rather say nothing on an error, you can print the error message to STDERR and exit without having said anything. The error message will when show up on the web server logs.

Now, to set this up, you'll have to stick this thing into the CGI-bin area on your web server. Let's say you stick it into ``/cgi-bin/mrv'', for example.

You'll then need to reference it from an HTML file, using the correct server-side-include directive. This also implies that it must be done from a file that is SSI-parsed, which may require certain conditions on the file. For example, you may have to adjust an ``.htaccess'' file to do that, or name the file something like ``whatever.shtml'', or turn on the ``execute bit''. Check with your webmaster about how to do that. (If you are the webmaster, and you still don't know, it's gonna be a little tough. Try asking around.)

For an Apache server, the easiest SSI directive looks something like this:

        <p>The most recent visits to this page came from
        <!--#include virtual="/cgi-bin/mrv" -->.

Remember that the output from mrv is something like ``host1, host2, and host3'', and write the text around it accordingly.

You'll also need to create the mrv database, which must be a file that is located in the same directory as the web page, but with ``.mrv'' appended to the end of the filename. The file must also be readable and writeable by the user-id of the webserver (typically ``nobody'' == -2), so the easiest is just to make it world writeable. For example, for a file called ``fred.html'', this'll get us started:

        touch fred.html.mrv
        chmod o+rw fred.html.mrv

So, try it out! Next month, I'll look at a more sophisicated database interaction, combining this technique of exclusive locking with a CGI script that generates forms for a browser. (No, it won't be a ``guestbook'', gaaaack!)

Listing 1

        =1=     #!/usr/bin/perl -Tw
        =2=     use strict;
        =3=     
        =4=     ## mrv: most recent visitors (by host)
        =5=     
        =6=     ## constants:
        =7=     my $MAX = 10;                   # how many visitors to keep
        =8=     
        =9=     ## system stuff
        =10=    $|++;
        =11=    $ENV{PATH} = "/usr/ucb:/bin:/usr/bin";
        =12=    
        =13=    ## HTML stuff
        =14=    print "Content-type: text/plain\n\n";
        =15=    
        =16=    ## the main program (in eval so we can trap problems)
        =17=    eval {
        =18=    
        =19=      ## get the CGI data
        =20=      my $uri = $ENV{DOCUMENT_URI};
        =21=      my $rhost = $ENV{REMOTE_HOST};
        =22=    
        =23=      ## split the URI up, so we know where the file was
        =24=      my ($dir,$file) = $uri =~ m,(.*/)(.*),s;
        =25=    
        =26=      ## massage the directory to get the containing dir
        =27=      if ($dir =~ m,^/~(\w+)/(.*)$,s) {
        =28=        $dir = "/home/$1/public_html/$2"; # teleport specific
        =29=      } else {
        =30=        die "cannot translate dir";
        =31=      }
        =32=    
        =33=      ## go there
        =34=      chdir $dir or die "cannot cd to $dir: $!\n";
        =35=    
        =36=      ## set up mrv file
        =37=      $file =~ /^([a-z0-9.]+)/i or die "Bad filename: $file";
        =38=      my $mrv = "$1.mrv";           # compute untainted mrv name
        =39=      open MRV, "+>>$mrv" or die "Cannot open $mrv: $!";
        =40=      flock MRV, 2;                 # wait for exclusive lock
        =41=      ## from here to the close MRV, we have critical region
        =42=      ## be sure to minimize this time
        =43=    
        =44=      ## get current content
        =45=      seek MRV, 0, 0;               # rewind to beginning
        =46=      my @content = <MRV>;          # get current content
        =47=    
        =48=      ## massage content
        =49=      $rhost .= "\n";
        =50=      @content = grep $rhost ne $_, @content; # eliminate dups
        =51=      push @content, $rhost;        # add new one to end
        =52=      splice(@content,0,@content - $MAX)
        =53=        if @content > $MAX;         # save only last $MAX entries
        =54=    
        =55=      ## write new list, and release file
        =56=      seek MRV, 0, 0;               # rewind again
        =57=      truncate $mrv, 0;             # empty the file
        =58=      print MRV @content;           # print the new content
        =59=    
        =60=      ## release file
        =61=      close MRV;
        =62=    
        =63=      ## now prepare the output
        =64=      ## b; a and b; or a1, a2, a3, ... a9, and b
        =65=      chomp @content;
        =66=      my $last = pop @content;
        =67=      print join ", ", @content;
        =68=      print ", " if @content > 1;
        =69=      print " and " if @content;
        =70=      print $last;
        =71=    
        =72=    };
        =73=    
        =74=    ## if an error, say so:
        =75=    chomp $@, print "[error: $@]" if $@;

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.