Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Download this listing!

Web Techniques Column 2 (May 1996)

The Web provides a nice, comfortable interface to find out about stuff. Sometimes, as the publisher of information, you have more ``stuff'' than will fit easily on a single web page. If you are dealing with static HTML pages, you simply break the ``stuff'' into pieces, and create a master index with links to the sub-pieces.

However, sometimes, the ``stuff'' that won't fit easily on one page is the result of a query. This is a problem, particularly when the query is the result of a user-initiated form, like searching through a database. In this case, as the CGI program author, you have to decide how to present the information in a way that doesn't flood the user.

One common approach, taken by such search engines as the way-cool Altavista and Dejanews searchers, is to present a small subset of the data (like the first 30 items), and then have a button or link to get you ``more'' of the search hits.

But, how is this done, in a stateless protocol like HTTP? One answer is to consider the use of ``sessions''.

A session is a logical collection of HTTP transactions. For each session, a unique ``session ID'' is created. For example, a session might be a particular query, or a particular set of visits to a web-site to conduct purchase (the ``shopping cart'' problem).

But, the session ID must somehow be communicated to each independent CGI program invocation. Some have suggested using the IP address of the browser client as the session ID. However, this fails in the face of ISPs where hundreds of users share the same IP address, and large commercial pay-for-play ISPs, where each query can actually be proxied through different host addresses!

So, a session ID has to be somehow communicated to the specific browser client, and then passed back. One possible solution has been proffered by the highly-nonstandard Netscape client, in the form of ``cookies''. Because this solution is valid only with the nonstandard Netscape, let's look at a more portable solution.

Another way of transferring the session ID is to include it as a parameter that gets passed to and from the browser with each communication.

To illustrate this method, I've constructed a toy program to perform a simple regular-expression search on the contents of the /usr/dict/words file. Depending on the regular expression, we could have zero hits to hitting every single line of the file, so it's a good quick test. However, the toy program will display only 25 hits at a time, and provide ``previous'' and ``next'' buttons to page through the resulting data.

So, let's take a look at the whole program in Listing 1 [below].

Line 2 turns on ``use strict'', forcing me to declare my variables.

Line 4 turns off the buffering of STDOUT, handy when I'm debugging, but not really needed for this program.

Line 5 declares a file-scoped variable called $MAXHITS, which I'm using as a constant later in the program. I've given it the value of 25, indicating that I don't want any more than 25 hits on a particular return page.

Line 6 declares $TMP, the prefix string for session information files.

As a security note, my script doesn't validate the session ID, so it's important that $TMP end in the ``middle'' of a filename, rather than at a directory name. This way, there's no possible input that someone could give that could access arbitrary files.

Line 8 extends the @INC searchpath so that line 9 can find the LWP library in my home directory. (My Internet Service Provider doesn't have LWP installed in the standard Perl search-path.) Obviously, this path won't work for you, so you should adjust as necessary.

And line 9 pulls in the ``encode'' routine from LWP so that I can translate a string into its HTML representation (& becomes &amp, and so on).

Line 11 pulls in the CGI library, handling the basic CGI interface.

Lines 13 through 15 define three global variables, described later.

Line 17 uses the routines of the CGI library to interpret the inputs to the CGI program, as well as setting up a number of variables automatically. The variable $query is a ``CGI object'', which can be used to access these variables and initiate further actions.

Line 19 uses the CGI object to display a valid HTTP header at the beginning of the output stream. It's important to make sure that the output begins with a valid header, or the response will be rejected by the HTTP server daemon with a ``invalid header'' error.

Lines 20-23 begin the HTML with an appropriate title heading and author link.

Line 24 displays the first-level heading that appears in the browser window.

Lines 25 to 42 determine which one of the three possible ``modes'' in which the script has been invoked, corresponding to the three states that the session might be in:

(1) gathering the initial search string

(2) processing the search string to generate the list of words, and start a browsing session

(3) continuing to display more words for a given session.

The difference between mode 3 and mode 2 is the presence of a ``session id'' parameter. The difference between mode 2 and mode 1 is the presence of a search string. We test them in reverse order just because it happened to be more convenient that way, but I'll describe them in forward order.

So, let's skip down to lines 39 to 41, corresponding to mode 1 above. In this case, we need to display the basic search form, and then exit. The search form is defined in the &search_form subroutine, so let's look at that, starting in line 98.

Lines 99 to 107 print the contents of a ``here document'', the body of which is in lines 100 to 106. The document is variable interpolated, meaning that $a is replaced with its current value. In fact, I'm making use of that to invoke certain CGI operations within the here document.

Line 101, in fact, is a variable interpolation, using a trick I stumbled across. @{FRED} in a double-quoted string interpolates FRED interpreted as an array (list) reference. Here, the FRED is actually [BARNEY], which is the value of BARNEY (evaluated in a list context) turned into an anonymous list reference. So, @{[WILMA]} evaluates WILMA in a list context, and brings it into a string. Very handy. Here, the value I'm trying to get into the string is the output of $query->startform, creating the beginning of a form that invokes this very same script (noted by the $query->script_name parameter).

The form will be submitted using ``POST'' protocol. By changing ``POST'' to ``GET'', a user could ``bookmark'' the next screen, and come back to the same search later, because the parameters for a ``GET'' protocol are encoded in the URL that fetches the response. This might be handy, and is worth thinking about when you are designing search engines.

Line 103 similarly uses the @{[WILMA]} trick to get a $query->textfield into the form -- here creating a field named ``search'' to hold the user's selected search string.

Line 104 generates a submit button, titled ``Search''.

Line 105 inserts a $query->endform, to delimit the form started with $query->startform.

Lines 100, 102, and 106 pretty much get output as-is, so they're basically straight HTML.

So, in mode 1, this form is tossed up on the screen. The user then fills in the field (named ``search''), and pushes the one button (labelled for the user's benefit as ``search''). This then reinvokes the same script, but now we have a string parameter named ``search''.

This parameter will be noticed on the second invocation of the script at line 29. The search parameter ends up in $search, and causes the lines from 30 to 37 to be executed.

Lines 32 to 34 create a @found array, containing all the words from /usr/dict/words that match the regular expression contained in $search.

Line 35 creates a ``session id''. A session ID should always depend somehow on both the time of day and the process-ID number of the Perl process, thus guaranteeing absolute uniqueness. Also, since this session ID is being used as part of a filename, the string should be both short (to keep the filenames from exceeding the max-length parameter) and simple (not having odd characters to keep the sysadm from yelling at you for odd filenames in /tmp). I've chosen to use a 12-character hex value. The first 8 characters are the 4-byte time-of-day string, and the last 4 characters are the 2-byte process ID number. If I wanted to get even tighter, I could have encoded it as base-64 instead of hex, or pulled even stranger things. But this works, and it's simple.

If the query was something that I wanted to keep secret, this session ID would not be long enough. I'd want to throw in another 40-50 bits of random garbage so that it'd be very hard to somehow hijack someone else's session. Again, the method of determining the session or even its encoding is not really important, as long as it includes time, process-ID, and perhaps some random junk if you want hijacking to be hard.

Once I have a session ID, I need to save the results of the current session into the session file, so that later invocations can access the same data. Line 36 invokes &save_session to do this. &save_session is defined later in lines 53 to 58, so let's see what it does.

Lines 54 to 55 create the ``session file'', created by concatenating the $TMP prefix and $session.

Line 56 dumps the current session. I've chosen to store the session as a series of newline-terminated lines. The first line is the search string, and the remaining lines are the search results. Of course, this will break if any element in this data contains newlines, so I had to think about what would be in $search and @found before I came up with this layout. For this particular query, this works fine.

Now, back up to the invoker of &save_session, in line 36.

Line 37 invokes the &display routine to show the first $MAXHITS page of the results. The parameter is 0, causing the &display routine to start at the first chunk of data. Let's look at &display, defined later in lines 60 to 96.

Line 61 shifts the @_ array, which holds the parameters to the routine. The first parameter is the starting place (0 when invoked from line 37), saved into the local variable $start.

Line 63 prints a message to the user to label the resulting search hits. The search string is HTML-encoded so that stuff like < and > and & don't mess up the browser.

Lines 64 to 69 determine a valid lower and upper bounds from within @found, based on the value of $start and $MAXHITS. Note that I do some sanity checking -- a user may have constructed an artificial URL like blahblah?start=10000000, and it's best to trap that as out-of-range, rather than do some bizarre result. Remember not to trust the data the user throws at you -- especially data that you believe you are generating, but the user can easily also generate.

Lines 70 to 77 print the applicable portion of the query results. Note the use of ``map'' and ``encode'' in line 75 to show the results in a way that the browser will display the original strings, rather than trying to interpret & and < and >.

Lines 78 to 86 generate a selectable link to the ``next'' hits, but only if there are more hits after the current chunk. This link invokes the same URL as the current script, but passes it the current session ID and a ``start'' parameter that is $MAXHITS higher than the current start value. These two parameters will enable the next invoked script to pick up where this one leaves off (described shortly).

Lines 87 to 95 define a ``previous'' link. On the first invocation, this link won't be shown, but on second and subsequent chunks, the previous link reinvokes the same script, passing it the current session ID, but a ``start'' parameter that is $MAXHITS lower, causing the program to start with the previous chunk.

So, those two ``next'' and ``previous'' links are the ones that put the user into ``mode 3''. When either of those links are followed, line 25 notices the presence of a ``session'' parameter (stuffed into $session), and executes the statements in line 26 through 28.

Line 27 loads up a previous session file by invoking &load_session, defined in lines 46 through 51. Let's see what that does.

Lines 47 and 48 open the session file. Note that this begins with ``<'' to ensure that even a user-constructed $session ending in ``|'' won't trigger a program-as-filehandle open. Always think security! And never trust your input data!

Line 49 reads the lines of the session file into $search and @found, which are then subsequently chopped to remove the newline separators. Note that this file format has to agree with whatever &save_session as used. However, I could change this format by altering these two subroutines (in a coordinated way), and the rest of the program wouldn't care.

Back up to line 28, now that the $search and @found data has been restored from the session file, which calls &display. Note, however, that we are passing &display a different starting position (not 0, as before). This starting position comes directly from the ``start'' parameter, causing the &display routine to probably display something besides the first $MAXHITS entries.

The only odd thing about this program is that it litters the /tmp directory with a new file for each search session. On my ISP's web server, /tmp is scrubbed daily for files that haven't been accessed in a day, so this isn't a problem. However, if you got really worried, you could launch an ``at'' job from &save_session that causes the session file to be rm'ed in say, 2 hours, or perhaps run a daemon that scrubs /tmp more often. It's up to you.

And there you have it. A little toy program that scans /usr/dict/words, and displays the results in 25 item chunks. What ``more'' could you ask for?

Listing 1

        =1=     #!/usr/bin/perl
        =2=     use strict;
        =3=     
        =4=     $|=1;
        =5=     my $MAXHITS = 25;               # constant: number of hits per page
        =6=     my $TMP = "/tmp/more.";         # constant: location of session files
        =7=     
        =8=     use lib "/home/merlyn/Lwp";
        =9=     use HTML::Entities qw(encode);
        =10=    
        =11=    use CGI;
        =12=    
        =13=    my $session;                    # global: session-ID
        =14=    my $search;                     # global: search string
        =15=    my @found;                      # global: array of valid hits
        =16=    
        =17=    my $query = new CGI;
        =18=    
        =19=    print $query->header;
        =20=    print $query->start_html(
        =21=                             'So you want more???',
        =22=                             'merlyn@stonehenge.com'
        =23=                            );
        =24=    print "<h1>Query the dictionary</h1>\n";
        =25=    if ($session = $query->param('session')) {
        =26=      ## we are in the midst of a session
        =27=      &load_session();              # sets $search, @found
        =28=      &display($query->param('start'));
        =29=    } elsif ($search = $query->param('search')) {
        =30=      ## we are beginning the query
        =31=      ## perform the query, and set up for session if necessary
        =32=      open WORDS,"/usr/dict/words";
        =33=      chomp(@found = grep /$search/o, <WORDS>);
        =34=      close WORDS;
        =35=      $session = unpack("H*", pack("Nn", time, $$)); # 12 hex chars
        =36=      &save_session();
        =37=      &display(0);
        =38=    } else {
        =39=      ## we are being invoked initially
        =40=      ## print the basic search form
        =41=      &search_form();
        =42=    }
        =43=    print $query->end_html;
        =44=    exit 0;
        =45=    
        =46=    sub load_session {
        =47=      open TMP, "<$TMP$session"
        =48=        or die "missing session file $TMP$session: $!";
        =49=      chop(($search, @found) = <TMP>);
        =50=      close TMP;
        =51=    }
        =52=    
        =53=    sub save_session {
        =54=      open TMP,">$TMP$session"
        =55=        or die "Cannot create $TMP$session: $!";
        =56=      print TMP map "$_\n", $search, @found;
        =57=      close TMP;
        =58=    }
        =59=    
        =60=    sub display {
        =61=      my $start = shift;            # where to start (undef/0 if beginning)
        =62=    
        =63=      print "You are searching for: ", encode($search), "\n";
        =64=      my $low = $start;
        =65=      ## sanity checking... won't happen unless user fakes us out
        =66=      $low = 0 if ($low || 0) <= 0;
        =67=      $low = $#found if $low > $#found;
        =68=      my $high = $low + $MAXHITS - 1;
        =69=      $high = $#found if $high > $#found;
        =70=      print
        =71=        "<br>Hits ", $low + 1,
        =72=        "..", $high + 1,
        =73=        " (of ".@found.") hits:\n",
        =74=        "<pre>\n",
        =75=        (map { encode($_)."\n" } @found[$low..$high]),
        =76=        "</pre>\n",
        =77=        "<hr>\n";
        =78=      if ($high < $#found) {
        =79=        print
        =80=          "<br><A HREF=\"",
        =81=          $query->script_name,
        =82=          "?session=$session&start=",
        =83=          $low + $MAXHITS,
        =84=          "\">",
        =85=          "See next $MAXHITS hits...</A>";
        =86=      }
        =87=      if ($low > 0) {
        =88=        print
        =89=          "<br><A HREF=\"",
        =90=          $query->script_name,
        =91=          "?session=$session&start=",
        =92=          $low - $MAXHITS,
        =93=          "\">",
        =94=          "See previous $MAXHITS hits...</A>";
        =95=      }
        =96=    }
        =97=    
        =98=    sub search_form {
        =99=      print <<_FORM_;
        =100=   <HR>
        =101=   @{[$query->startform('POST',$query->script_name)]}
        =102=   <p>Search for:
        =103=   @{[$query->textfield('search')]}
        =104=   <p>@{[$query->submit('Search')]}
        =105=   @{[$query->endform]}
        =106=   <HR>
        =107=   _FORM_
        =108=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Web Techniques Column 2 (May 1996)

Listing 1