Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 3 (June 1996)

In last month's column, I talked about how to return just part of the results of the query, and maintain state so that a browser could keep asking for more and more of the query to be revealed.

One of the problems with that approach is that sometimes, the user wants it all at once! This is especially true if the result is being used as part of a larger document. You don't want 275 hits just 25 at a time -- you want the 275 hits.

So, how do we go the other way? In other words, can we write a client program that acts as a browser, but fetches all the results of a query that's being handed to us in bits and pieces, building up a large comprehensive document? (Well, if we couldn't, this'd be a pretty short column, so keep reading.)

As an exercise, I took the wonderful Altavista search engine (http://www.altavista.digital.com/) out for a spin. When I do a web query, I get the answers back 10 at a time. So, my goal was to roll the information up into a single text response.

The easiest way to talk to Altavista is with the wonderful LWP library, found in the Comprehensive Perl Archive Network (CPAN). If you don't have LWP, you can fetch it from the nearest CPAN archive using the nifty keen CPAN Multiplex Dispatcher using the URL http://www.perl.com/CPAN/modules/bymodule/LWP/libwww-perl-5b10.tar.gz.

I must apologize... in the last two columns, I referred to LWP as being authored by fellow columnist Lincoln Stein. Lincoln did indeed create the handy CGI modules, but LWP was authored by Martijn Koster and Gisle Aas. Sorry guys.

By having a program act as a web client, I can perform multiple ``GETs'' to ask for all the data, and then glue it together. Although I used Altavista here, you could easily adapt the technique to any search engine that has one of those durned ``more'' buttons.

The program is given in Listing 1 [below]. I call it ``alta-search'', but you could call it ``fred'' for all I care.

Like all good programs, it starts with ``use strict'' (in line 2). This turns on required variable declaration (preventing stray variable names), disables soft references (preventing text-becomes-symbol-name mistakes), and disables ``poetry mode'' (preventing un-adorned symbols from accidentally being quoted in this version, but turning into subroutine invocations in the next).

Line 3 turns on unbuffered STDOUT. This was handy during testing to see what had been seen so far, especially for long queries. Later on, I could probably rip this out, as it will really only be used to generate entire text files.

Line 4 defines a version number for this program. I use this later to construct an ``agent'' string, identifying the program to the Altavista people.

Line 6 points the remaining ``use'' directives at my personal copy of the LWP, because teleport.com isn't keeping up with the latest versions.

Line 7 pulls in the LWP::Useragent.pm module, defining the top-level interface for Web transactions. It probably pulls in a number of other modules as well, but that really doesn't matter to me at this level.

Line 8 pulls in the URI::Escape.pm module (found in the LWP package). I use this to turn the query string into part of a URL.

Line 10 takes the first command-line argument (the result of the shift), and creates something that would be suitable for part of a URL for the query. For example, spaces become %20. The result is stored in $query, used later.

Line 12 defines a constant $QBASE, which gives the base URL for the query string. I append stuff to this string to get the actual query.

Lines 14 through 16 set up a ``user agent''. This is the object that represents ``the web'' to me. By proper method calls against this user agent, I'll be performing various HTTP transactions.

Line 14 creates the user agent.

Line 15 sets the user-agent's name to ``alta-search/'' followed by the version number (defined above). The Altavista people can look through the agent log at their web server and figure out how many times they've been hit by this script then. (If we wanted to be really tricky, we could confuse them by setting it to ``Mozilla/3.0 (Sony Playstation)''.)

Line 16 causes the user agent to examine the standard Unix environment variables that define ``proxy servers'', often used in a corporate or educational environment where access to the full Internet is limited. If this isn't needed in your area, it'll be a no-op.

Lines 18 to 28 form the heart of the algorithm.

Line 18 sets a new variable $id to 0. This variable will be used and incremented repeatedly to get each chunk of output from Altavista.

Line 19 labels the output with an HTML <pre> tag, which causes embedded whitespace and newlines to retain their shape. The original query returned stuff that wanted to be inside <pre>, so I kept it that way.

Lines 20 and 26 form a ``naked block'', which will serve as a loop (thanks to the ``redo'' in line 25, which I'll describe in a minute).

Line 21 calls the &fetch subroutine (defined later) to grab a page. The first (and only) argument is the URL to fetch.

This URL was determined by staring at the initial query form presented by Altavista. Because it is a ``GET''-type query, every useful field ends up as part of the URL. It took me about five tries to get this right, and of course, it'll break when the Altavista people change the fields, but it worked as I was writing this. The various fields are delimited by ``&'', and within each field, there's ``id = value''.

From what I gather, ``what=web'' selects a web-search -- I could make this ``what=usenet'' to get a usenet search instead.

And ``fmt=c'' selects the compact format... there's also counts and detailed formats. The nice thing about compact format is that it parses well for further processing.

Then there's ``pg=q'', which I never quite figured out, but it's needed.

The next parameter ``q'', specifies the query string. Normally, this is the value of the ``simple query'' box, but we're faking the form, so we get it from the $query string, which has already been URI-encoded so that we can just drop it into this string.

Finally, there's ``stq''. This is the index number of the first item to be returned. To get the first 10 hits, I set $id to 0, and to get the next 10 hits, I set $id to 10, and so on (20, 30, 40, etc.). Luckily, if the number is more than the number of hits, I get back a good page but an empty list, so I'll be using that to abort this loop. I found this parameter by asking for something with a lot of hits (it was ``click here'' if you must know), and examining the resulting HTML at the bottom of the page to see how to go to the second, third, and fourth (etc.) page of the output.

Again, this is normally an internal-use field, and Altavista could easily change this and break my program. But I think I got the usage right (it worked for my dozen test cases).

Lines 22 and 23 look for the text of the hits within the returned HTML page. After staring at the page for a while, I realized that <pre> and </pre> bracket the data, but unfortunately, they weren't the only <pre> and </pre> on the page, so I had to look for a <pre>-</pre> that contained just lines that began with ``<a href=''. Bingo. Again, if Altavista changes their format next week, I'll have to ``maintain'' this code.

Provided the format is good, line 24 prints the resulting lines with just the hits.

Line 25 increments the starting ID number, and starts the loop over, as long as there is some non-blank character in the string. If the data is completely blank, I've hit the last entry, and should stop.

Line 27 prints the closing </pre>, balancing the initial <pre> in line 19. If I were using the output as an entire web page, I'd probably need to put some stuff ahead and after these tags to make it compliant.

Lines 29 to 36 create a ``fetch'' routine that I've re-used in at least one other little webtool. So even though it's used only once here, I still left it as a subroutine just to separate the ``fetch URL'' code from the loop for clarity.

Line 30 takes the first (and only) argument, and names it $url.

Line 31 creates a HTTP Request object from the ``GET'' type and the selected URL (in $url). This doesn't actually perform the request, but it caches all the information in one convenient place.

Line 32 actually performs the request, using the global user agent created above. The result of the request goes into a new object stored in $response.

If the request failed, line 34 detects this using the ``is_success'' call against the response, and dies with an appropriate message (line 33).

Otherwise, the text contents of the response (the HTML page) is returned as the return value in line 35.

So there you have it. With a few modifications, we could even make it fetch Usenet instead of Web searches, or select just the URLs and fetch those, and so on. Or handle advanced queries, including date ranges. But that would be a good topic for a future column, so I'll leave that for later. Let's go surfing!

Listing 1

        =1=     #!/usr/bin/perl
        =2=     use strict;
        =3=     $| = 1;
        =4=     my $VERSION = "1.0";
        =5=     
        =6=     use lib "/home/merlyn/Lwp";
        =7=     use LWP::UserAgent;
        =8=     use URI::Escape;
        =9=     
        =10=    my $query = uri_escape(shift);
        =11=    
        =12=    my $QBASE = "http://www.altavista.digital.com/cgi-bin/query";;
        =13=    
        =14=    my $ua = new LWP::UserAgent;
        =15=    $ua->agent("alta-search/$VERSION");
        =16=    $ua->env_proxy;
        =17=    
        =18=    my $id = 0;
        =19=    print "<pre>\n";
        =20=    {
        =21=      $_ = &fetch("$QBASE?what=web&fmt=c&pg=q&q=$query&stq=$id");
        =22=      die "unknown format for $_"
        =23=        unless s#^[\s\S]*<pre>\n((<a href=.*\n)*)</pre>[\s\S]*$#$1#;
        =24=      print;
        =25=      $id += 10, redo if /\S/;
        =26=    }
        =27=    print "</pre>\n";
        =28=    
        =29=    sub fetch {
        =30=      my $url = shift;
        =31=      my $request = new HTTP::Request('GET', $url);
        =32=      my $response = $ua->request($request);
        =33=      die "$url failed: ",$response->error_as_HTML
        =34=        unless $response->is_success;
        =35=      $response->content;
        =36=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.