Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 6 (September 1996)

Well, I knew it. In my last column, I wrote about fellow columnist's Lincoln Stein's new CGI.pm, version 2.20a, which added a nifty feature of writing HTML within a CGI script with minimal effort. I said:

Well, sure enough, it did. The keyword ``standard'' became ``:standard'', less than 48 hours after I sent the column to bed. If you had trouble making it work, go back and fix it now, especially if you are using the current rev (2.21) or later.

This time, I'm looking at a different problem that doesn't involve HTTP at all (unlike nearly all of my previous columns). But nevertheless it's a common problem facing website administrators. Specifically, how do you keep visitors coming back once they've been there?

One of the ways to keep them coming back is changing the content of the site. But then, a new problem arises. How do you both inform the visitors that things *do* change, and where those changes are?

Well, the simplest, mechanical way is to create a ``what's new'' list by examining the timestamps on your HTML files. For this month's column, I wrote a simple script to do just that. (This month's idea comes from fellow Perl hacker Joseph Hall <joseph@5sigma.com>, by the way.)

The script is presented here in Listing 1 [below].

Line 3 enables the compile-time restrictions, as in nearly all the scripts I write. This helps me catch typo-ed variable names and poetry-mode barewords.

Line 4 pulls in the standard File::Find module. This module is provided with the Perl distribution. If this line fails when you try to run this script, your installer made a mistake.

Line 5 adds a directory in my home directory to the standard searchlist for modules. I recently re-organized everything I'm getting from the CPAN (http://www.perl.com/CPAN/) into a nice hierarchy. so all my scripts that use CPAN modules are now looking roughly the same. I even created a methodology for installation that works really well. (Hmm. Maybe that'll be described in a future column. :-)

Line 6 pulls in the HTML::Entities module from the now-famous LWP library. This library is located in the CPAN if you don't have it; just browse http://www.perl.com/CPAN/authors/id/GAAS for the latest version. I'm using the recently-released verion 5. This particular module provides the ``encode_entities'' function, which I need to properly escape text into HTML.

I tried to make this script somewhat adaptable for your use, so lines 8 through 34 form a ``configuration'' section. You should be able to merely change the values of these variables to get the script to work for you.

Lines 9 and 10 define a full path to the top-level directory or directories that will be scanned for new files. This is similar to the first argument (or arguments) to the UNIX find command.

In my configuration, I'm examining my top-level HTML directory on Teleport.

Lines 11 through 14 define a subroutine called PRUNE. This subroutine will be called repeatedly while the file-tree is being walked. PRUNE will be given two parameters: the basename of the file or directory being examined, and the full path of that same file or directory. If you don't want this program to wander into private areas, just return a true value (such as 1), and the find routine will wander away from that directory. This is similar to the ``-prune'' switch of the UNIX find command.

In my particular configuration, I'm avoiding any directory that contains the string ``private'' in its basename.

Similarly, lines 15 to 18 deine a subroutine called IGNORE, which selects which files are not of interest, or are private. IGNORE is called for each file, with parameters like those of PRUNE. If the routine returns true, then that particular file will not be considered for final output.

In my configuration, I don't want dot-files showing up (like .htaccess), nor GNU Emacs backup files (ending in ~) or standard-named GIFs or JPEGs.

Lines 19 and 20 define how many of the newest URLs to retain. For my example, I've just got the Dave-Letterman-style ``top ten''.

Lines 21 to 25 define a TRANSFORM subroutine. This subroutine is a little tricky, and is needed to translate the UNIX file paths into valid URLs. I'm showing the transformation for Teleport's web server, in which the file /home/merlyn/public_html/fred.html is visible via the URL http://www.teleport.com/~merlyn/fred.html. The filename will be given to TRANSFORM as the first parameter. The return value must reflect its proper URL form.

Note that I'm returning back just something like /~merlyn/fred.html here, because I'm expecting to use the output of this script as an include file already within the right server and protocol.

Lines 26 and 27 define a string that will precede the output list. I'm creating an unnumbered list here, so I need <UL>.

Lines 28 and 29 similarly define a string that will precede each output list element. I need <LI> for this.

Lines 30 through 33 similarly define the ending item and list strings.

If you are adapting this to your website, you should not need to make any changes below line 34.

Line 36 declares a hash (associative array) called %when. The keys to this array will be full UNIX pathnames, and the corresponding values will be their ``mtime'' value in UNIX internal time values. This reflects the most recent modified time in a nice numeric easily sortable value.

Lines 38 to 44 invoke the ``find'' routine, imported when we said ``use File::Find''. The first parameter has to be a subroutine reference. This subroutine will be called for each directory and file as the find routine wanders the hierarchy. Here, I'm using an anonymous subroutine directly as the first parameter, described in a moment.

The remaining arguments to the find routine declare the top-level directories that will be used as the initial starting points. That's gotta be @TOP for us, passed in line 44.

The anomymous subroutine body is defined in lines 39 through 43. Lines 39 and 40 handle the pruning of unwanted directories. We pass the basename of the current candidate ($_) and the full pathname ($File::Find::name) as two parameters to PRUNE (defined above). If the subroutine returns true, then not only is this guy history, but so are all of his kids and grandkids. We indicate that by setting the variable $File::Find::prune to 1, and returning. (This interface is described in reasonable detail in the documentation for File::Find.)

Line 44 ensures that we consider only plain files for the rest of the subroutine; if it's not a file, we return quickly.

Line 45 sees if the file is a potential candidate, by calling the IGNORE routine (defined above). Again, the subroutine gets the basename and full path name as its two arguments. If the return value is true, we bail out of this subroutine, thus ignoring the file.

If we made it through all of those hoops, then it's a file worth checking the timestamp for. Wow. Line 43 records the timestamp (computed as the 10th parameter of the return value of stat) into the %when hash, using a key of the full pathname.

When the invocation of find has completed, we will have gathered the timestamps of all ``interesting'' files. The next step is to sort these files by their timestamps, and save only the most recently changed.

Line 46 sorts the keys of the %when hash using a sort block, which compares the corresponding values numerically. Note that $b and $a are reversed from their traditional order, which will cause this resulting list to have those elements with the largest %when values first, indicating that those are the most recently updated.

Line 47 tosses everything from the @names array after the $HOWMANY element. If there are not enough elements, all of them will be retained.

Line 49 declares $name to be lexically local, so that the foreach loop which declares yet another local $name will be successful. (Sometimes strict seems to be a bit too strict. :-)

Line 50 displays the list prefix string on standard output.

Lines 51 through 58 are executed for each element of the @names array, putting each name into $name.

Line 52 creates a URL string from the full UNIX path. First, the path is passed to the TRANSFORM routine (defined above). Next the result of that routine is passed to the encode_entities routine, imported from HTML::Entities above. This routine escapes > as &gt; and so on, so that they read properly in HTML. Without this, some of the characters would break the HTML output.

Lines 53 through 57 display a particular item from the list. First, the prefix is printed, follwed by the beginning of an HTML ``A'' tag, with an HREF attribute of the URL. Next this is followed by the URL again (to make it visible), and then the timestamp. Note that the timestamp is converted from UNIX internal time into human-readable time using the ``localtime'' operator. Lastly, this is followed by the item suffix.

Line 59 finishes up the output list.

I run this script nightly from cron, sending its output to a file ``whatsnew.txt'' in my top-level directory. I then include the output directly into my homepage using a Server-Side Include directive:

        <!--#include file="whatsnew.txt" -->

To see the results, look at my homepage, near the top:

        http://www.teleport.com/~merlyn/

If you want to have the script generate an entire HTML page (say, perhaps you can't use Server-Side Includes), then you can adjust $LIST_PRE and $LIST_POST to be the complete HTML header and footer.

I hope you've enjoyed finding out ``what's new''. I'm toying around with web-crawling scripts for the next column, so unless I get distracted by something niftier, that's what it'll be about.

Listing 1

        =1=     #!/usr/bin/perl
        =2=     
        =3=     use strict;
        =4=     use File::Find;
        =5=     use lib "/home/merlyn/CPAN/lib";
        =6=     use HTML::Entities;
        =7=     
        =8=     ## begin config
        =9=     my (@TOP) =                     # top level directories
        =10=      qw(/home/merlyn/public_html);
        =11=    sub PRUNE {                     # don't search these dirs
        =12=      ## $_[0] is basename, $_[1] is full path
        =13=      $_[0] =~ /private/;
        =14=    }
        =15=    sub IGNORE {                    # don't notice these files
        =16=      ## $_[0] is basename, $_[1] is full path
        =17=      $_[0] =~ /^\.|~$|\.(gif|jpe?g)$/;
        =18=    }
        =19=    my ($HOWMANY) =                 # keep this many new files
        =20=      10;
        =21=    sub TRANSFORM ($;) {            # turn path into URL
        =22=      local($_) = @_;
        =23=      s!/home/merlyn/public_html/!/~merlyn/!;
        =24=      $_;
        =25=    }
        =26=    my ($LIST_PRE) =                # prefix list
        =27=      "<UL>\n";
        =28=    my ($ITEM_PRE) =                # prefix item
        =29=      "<LI>";
        =30=    my ($ITEM_POST) =               # suffix item
        =31=      "\n";
        =32=    my ($LIST_POST) =               # suffix list
        =33=      "</UL>\n";
        =34=    ## end config
        =35=    
        =36=    my (%when);                     # record of stamps
        =37=    
        =38=    find (sub {
        =39=            return $File::Find::prune = 1
        =40=              if PRUNE $_, $File::Find::name;
        =41=            return unless -f;       # only files
        =42=            return if IGNORE $_, $File::Find::name;
        =43=            $when{$File::Find::name} = (stat _)[9];
        =44=          }, @TOP);
        =45=    
        =46=    my @names = sort { $when{$b} <=> $when{$a} } keys %when;
        =47=    splice(@names, $HOWMANY);       # discard older stuff
        =48=    
        =49=    my $name;                       # shuddup $name
        =50=    print $LIST_PRE;
        =51=    for $name (@names) {
        =52=      my $url = encode_entities TRANSFORM $name;
        =53=      print
        =54=        $ITEM_PRE,
        =55=        "<A HREF=\"$url\">$url</a> on ",
        =56=        scalar localtime $when{$name},
        =57=        $ITEM_POST;
        =58=    }
        =59=    print $LIST_POST;

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.