Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Download this listing!

Web Techniques Column 63 (Jul 2001)

[suggested title: Calculating download time]

It's a simple problem, really. You click on a link to a site. It starts loading. But you're on a slow modem dialup. So it keeps loading, and loading, and loading, the little browser icon in the upper right continuing its animation as if that will distract you enough to not notice that the page still hasn't finished. But still it keeps loading. Until finally, just when you're about to hit ``stop'', it's done.

Why are these pages so big? Why are most pages so unfriendly to slow links? I suspect it's because most pages are being designed on intranets these days, and nobody ever bothers to go home to test it out from their ISP connection. And that's unfortunate.

As a simple test, we could at least write a little program to download the entire page with all of its links, and see how many bytes would be needed to satisfy the browser. Of course, we'd have to go through all of the HTML, looking for embedded images, sounds, objects (like flash) and frame links. Hmm, sounds like a lot of work. Unless you have the right tools, like Perl's wonderful LWP library. So I wrote such a program, and I'm presenting it in [listing one, below].

Line 1 turns on warnings: a good thing while developing. Line 2 enables the common compiler restrictions, forcing us to declare our variables, avoid the use of barewords, and stay away from symbolic references. Line 3 enables autoflushing on STDOUT, so each line of output will be immediately visible as it is printed.

Lines 5 through 8 pull in 4 modules from the LWP library. If you don't have LWP installed, you can use the CPAN module to do this without a lot of hassle as follows:

  $ perl -MCPAN -eshell
  cpan> install Bundle::LWP

You may have to answer some questions if this is your first time installing with CPAN.pm.

The LWP::UserAgent module provides the web client object. HTTP::Cookies handles cookie-based interactions. HTTP::Request::Common creates common web requests. And HTML::LinkExtor parses HTML to find external URL links.

If you didn't know that these were all part of the LWP CPAN bundle, you could have just asked for them explicitly, as in:

  $ perl -MCPAN -eshell
  cpan> install LWP::UserAgent
  cpan> install HTTP::Cookies
  cpan> install HTTP::Request::Common
  cpan> install HTML::LinkExtor

Lines 10 through 28 define the %LINKS global hash, selecting the types of tag attributes that we believe will be loaded by the browser directly. I got this list easily; I grabbed the HTML::Tagset module and copied the value of linkElements out of there into here. Then I decided which of the items was a browser load. Please don't consider this list authoritative; it's just a best guess (and I'd appreciate feedback if I erred).

Note that one of the items is the href attribute of the link tag, which often contains the CSS file, loaded by the browser. However, it also often contains other URLs that are not, and this is distinguishable only by looking at other attributes in the tag, which are unfortunately not provided by HTML::LinkExtor. Maybe a future version of HTML::LinkExtor will address this need.

Also, I'm not doing anything to look within the JavaScript of a page to get dynamic rollovers. That's a hard problem. I'd probably need a full JavaScript interpreter to do that. Ugh.

Lines 30 through 33 create the ``virtual browser'' as the global $ua variable, by first creating a generic object (line 30), giving it the right proxy information from the environment (line 31), setting a distinct user agent (line 32), and then setting up an in-memory ``cookie jar'' (line 33). The cookie jar allows us to visit pages that depend on having visited the referring URL, since any cookies will be properly handled by virtual browser.

Line 35 calls the main routine report for each URL mentioned on the command line. Line 37 ends the program when we're done (not strictly necessary, since it's just subroutines from here down, but I like to keep my programs maintainable).

Now for the good stuff. The report routine beginning in line 39 is given a single URL, extracted in line 40. This routine pulls down the page, and examines it for all URLs that would also have been loaded in a typical frame- and image- grokking browser. To do this, we'll maintain and populate two top-level data structures: the @todo array in line 42, and the %done hash in line 43.

The @todo array is a list of items that must still be processed, in the form of two-element references to arrays. The first element is the source URL (used to properly set the referer header), and the second element is the URL to fetch. We'll initially load the URL of interest into @todo, with an empty referer string. The %done hash serves double-duty, both as a way of tracking which URLs we've already done (the keys), but also the number of bytes for that URL (the value) for later analysis.

Line 45 begins the ``while there's still more to-do'' loop. Line 46 pulls off the first arrayref from the list, and explodes it into the referer string and the URL to be fetched. Line 47 skips over the ones we've already done.

Lines 49 and 50 fetch the page, by creating a GET request with the proper referer header, and fetching it with the virtual browser. The result is always an HTTP::Response object.

If the fetch is successful, line 52 detects that. We'll take the content into $content in line 53, and put its length into the value for %done as well, keyed by the URL we just fetched.

If the content was HTML, this means the browser would then have displayed the HTML, and will then crawl through looking for images, sounds, subframes, and other things, so we must do likewise. Line 55 detects the presence of such HTML by noting the MIME type of the content.

Line 57 pulls out the ``base'' URL from the HTTP::Response object. Normally, this is the same as the original URL. However, if the HTML header contains a ``base'' URL, then a browser would have calculated all relative URLs from that base URL, not the fetching URL, so we must do likewise. Luckily, the base method just does the right thing and gives us what we need, regardless of whether it was specified or not.

Line 58 sets up the HTML::LinkExtor object, giving it no callback function, but specifying the base URL for relative links. Without a callback function, the data is merely gathered into an internal object variable, which we'll access with a links method later. Lines 59 and 60 hand the HTML content to the object, triggering a complete parse pass.

Lines 61 through 71 pull out the link information, item by item. Each link (in $link) is an arrayref. The arrayref is dereferenced in line 62 to reveal an initial tagname (ending up in $tag) followed by key/value pairs of attributes, which are rolled up into the %attr hash.

Line 63 detects the tagname being an item of interest. The HTML::LinkExtor module finds all possible outbound URLs, but we're interested only in the ones that would be loaded by the browser right away. If the tag is in %LINKS, then we check each of the attributes in the list to see if it's something we saw in the HTML content (line 65) and it's a non-empty value (line 66).

If we have a URL of interest, we push that onto the ``to-do'' list, as a two-element item, with the $base URL for a referer. I was puzzled if this should be $base or $url, and settled on $base for no solid reason. You could probably write me to convince me I'm wrong on this and I wouldn't take it personally.

Well, that handles the typical HTML page. But we also have some other kinds of responses from a fetch. The other typical response is a ``redirect'', where the remote server indicated that a different URL should be fetched. That's handled in lines 73 to 76. First, we count the length of the content (because the browser would still be fetching all of the response content), then fetch the location header (the new URL), and then queue up a fetch of this new URL. Again, I wasn't sure what referer string should be given to this fetch, so I settled on $url. Again, I could be easily argued out of it (and I bet it's inconsistent amongst browser implementations).

And if it's not a good fetch or a redirect, it's probably an error, detected in line 77. Line 78 merely dumps this information to the invoker, on the odds that this is something that won't make much sense to try to correct from, and simple information is all that's required.

When all the to-do list has been completed, we drop out of the big loop, down to the reporting phase, beginning in line 83. Line 84 creates a local temporary variable, initializing it to 0. This will be the total number of bytes fetched by the browser for this URL. Line 86 shows the URL name for labelling purposes.

Lines 87 to 90 go through all the entries in the %done hash. Elements of the hash are sorted by descending numeric order on the values, so that we get a quick idea of the piggiest part of the page. As each URL and bytesize is pulled up, the total is updated in line 88, and formatted nicely in line 89 using printf.

Lines 91 and 92 do what we came here for. The total downloaded bytes are shown in line 91, and line 92 computes the download time at a conservative 2000 bytes per second on a 28.8 dialup line. (I'm painfully aware of how slow this is as I spend much of my time bouncing around from one hotel to another, very much missing my cable modem at home.)

Let's look at a couple of typical reports, for both www.webtechniques.com and for my own archive of columns (and listings) at www.stonehenge.com/merlyn/WebTechniques/:

    $ dltime http://www.webtechniques.com/ http://www.stonehenge.com/merlyn/WebTechniques/
    http://www.webtechniques.com/ =>
           29945  http://www.webtechniques.com/gifs/covers/0105cov_straight.jpg
           18548  http://www.webtechniques.com/
           10525  http://img.cmpnet.com/ads/graphics/cs/cg/heyyou.gif
            9604  http://img.cmpnet.com/ads/graphics/cs/ar/webreview_125.gif
            9342  http://img.cmpnet.com/ads/graphics/cs/ar/wiwm_120.gif
            7691  http://www.webtechniques.com/gifs/subscribeto.jpg
            7336  http://img.cmpnet.com/ads/graphics/cs/cg/latest.gif
            5950  http://img.cmpnet.com/ads/graphics/cs/ar/develop_120x240.gif
            3577  http://www.webtechniques.com/gifs/wtlogo_right.gif
            2096  http://www.webtechniques.com/gifs/logo_footer_r2_c3.gif
            2056  http://img.cmpnet.com/ads/graphics/cs/ar/tech_reviews_120.gif
            1934  http://www.webtechniques.com/gifs/logo_footer_r2_c4.gif
            1707  http://www.webtechniques.com/gifs/logo_footer_r2_c1.gif
            1463  http://www.webtechniques.com/gifs/wtlogo_left.gif
            1365  http://www.webtechniques.com/gifs/logo_footer_r2_c2.gif
            1094  http://www.webtechniques.com/gifs/logo_footer_r1_c1.gif
             860  http://www.webtechniques.com/gifs/triangle.gif
             419  http://www.webtechniques.com/gifs/toctab.gif
             382  http://newads.cmpnet.com/js.ng/Params.richmedia=yes&site=webtechniques&pagepos=middletile&webreview_pos=wthome
             369  http://newads.cmpnet.com/js.ng/Params.richmedia=yes&site=webtechniques&pagepos=verticalbanner&webreview_pos=wthome
             363  http://newads.cmpnet.com/js.ng/Params.richmedia=yes&site=webtechniques&pagepos=topleftbutton
             346  http://newads.cmpnet.com/js.ng/Params.richmedia=yes&site=webtechniques&pagepos=bottombutton
             337  http://newads.cmpnet.com/html.ng/site=webtechniques&pagepos=middletile&webreview_pos=wthome
             335  http://newads.cmpnet.com/js.ng/Params.richmedia=yes&site=webtechniques&pagepos=tile&webreview_pos=wthome
             333  http://newads.cmpnet.com/js.ng/Params.richmedia=yes&site=webtechniques&pagepos=top&webreview_pos=wthome
             328  http://newads.cmpnet.com/html.ng/site=webtechniques&pagepos=verticalbanner&webreview_pos=wthome
             328  http://newads.cmpnet.com/js.ng/Params.richmedia=yes&site=webtechniques&pagepos=bottom
             318  http://newads.cmpnet.com/html.ng/site=webtechniques&pagepos=topleftbutton
             301  http://newads.cmpnet.com/html.ng/site=webtechniques&pagepos=bottombutton
             290  http://newads.cmpnet.com/html.ng/site=webtechniques&pagepos=tile&webreview_pos=wthome
             288  http://newads.cmpnet.com/html.ng/site=webtechniques&pagepos=top&webreview_pos=wthome
             283  http://newads.cmpnet.com/html.ng/site=webtechniques&pagepos=bottom
             261  http://www.webtechniques.com/gifs/bottom341head.gif
             252  http://www.webtechniques.com/gifs/top341head.gif
             147  http://www.webtechniques.com/gifs/bottom127head.gif
             142  http://www.webtechniques.com/gifs/top127head.gif
              69  http://www.webtechniques.com/gifs/mid4head.gif
              54  http://www.webtechniques.com/gifs/bottom4head.gif
              53  http://www.webtechniques.com/gifs/top4head.gif
              42  http://www.webtechniques.com/gifs/pixel.gif
          121133 TOTAL
              61 seconds at 28.8

    http://www.stonehenge.com/merlyn/WebTechniques/ =>
           13916  http://www.stonehenge.com/merlyn/WebTechniques/
            9102  http://s1-images.amazon.com/images/A/TP10000000000000008.0206.04._ZAGreetings,7,2,15,108,verdenab,8,204,102,0_ZADonate%20to%20my%20Legal%20Defense%20Fund%20today%21,7,14,50,108,times,11,0,0,0_.jpg
            5684  http://www.oreilly.com/catalog/covers/lperl2.s.gif
             861  http://images.paypal.com/images/x-click-but7.gif
             172  http://www.stonehenge.com/icons/right.gif
               0  http://s1.amazon.com/exec/varzea/tipbox/A3QRJ0PB8JM4E4/T18JT4YV6ZQDB2
               0  http://s1.amazon.com/exec/varzea/tipbox/A3QRJ0PB8JM4E4/T18JT4YV6ZQDB2/058-6613347-7642417
           29735 TOTAL
              15 seconds at 28.8

Ahh... look at that. That's 61 seconds of download time for the WebTechniques site, but only 15 seconds for my site. How nice. The big bottleneck on the WebTechniques homepage seems to be the cover JPG. Maybe they can reduce the JPEG Q factor a bit to save some time. Or maybe this is an acceptable tradeoff. But at least now we know how long it'll take me to visit the WebTechniques homepage from a hotel room. Time enough to hum the ``Final Jeopardy'' theme twice while I'm waiting. So until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|=1;
        =4=     
        =5=     use LWP::UserAgent;
        =6=     use HTTP::Cookies;
        =7=     use HTTP::Request::Common;
        =8=     use HTML::LinkExtor;
        =9=     
        =10=    my %LINKS =                     # subset of %HTML::Tagset::linkElements
        =11=    (
        =12=     'applet'  => ['archive', 'codebase', 'code'],
        =13=     'bgsound' => ['src'],
        =14=     'body'    => ['background'],
        =15=     'embed'   => ['src'],
        =16=     'frame'   => ['src'],
        =17=     'iframe'  => ['src'],
        =18=     'ilayer'  => ['background'],
        =19=     'img'     => ['src', 'lowsrc'],
        =20=     'input'   => ['src'],
        =21=     'layer'   => ['background', 'src'],
        =22=     ## 'link'    => ['href'], ## durn, some of these are stylesheets
        =23=     'script'  => ['src'],
        =24=     'table'   => ['background'],
        =25=     'td'      => ['background'],
        =26=     'th'      => ['background'],
        =27=     'tr'      => ['background'],
        =28=    );
        =29=    
        =30=    my $ua = LWP::UserAgent->new;
        =31=    $ua->env_proxy;
        =32=    $ua->agent("dltime/1.00 ".$ua->agent); # identify ourselves
        =33=    $ua->cookie_jar(HTTP::Cookies->new); # capture cookies if needed
        =34=    
        =35=    report($_) for @ARGV;
        =36=    
        =37=    exit 0;
        =38=    
        =39=    sub report {
        =40=      my $start = shift;
        =41=    
        =42=      my @todo = ["", $start];
        =43=      my %done;
        =44=    
        =45=      while (@todo) {
        =46=        my ($refer, $url) = @{shift @todo};
        =47=        next if exists $done{$url};
        =48=    
        =49=        my $request = GET $url, [referer => $refer];
        =50=        my $response = $ua->simple_request($request);
        =51=    
        =52=        if ($response->is_success) {
        =53=          $done{$url} = length (my $content = $response->content);
        =54=    
        =55=          next if $response->content_type ne "text/html";
        =56=    
        =57=          my $base = $response->base; # relative URLs measured relative to here
        =58=          my $p = HTML::LinkExtor->new(undef, $base) or die;
        =59=          $p->parse($content);
        =60=          $p->eof;
        =61=          for my $link ($p->links) {
        =62=            my ($tag, %attr) = @$link;
        =63=            if ($LINKS{$tag}) {
        =64=              for (@{$LINKS{$tag}}) {
        =65=                next unless exists $attr{$_};
        =66=                next unless length (my $a = $attr{$_});
        =67=                ## print "$base $tag $_ => $a\n"; ## debug
        =68=                push @todo, [$base, $a];
        =69=              }
        =70=            }
        =71=          }
        =72=          
        =73=        } elsif ($response->is_redirect) {
        =74=          $done{$url} = length $response->content; # this counts
        =75=          my $location = $response->header('location') or next;
        =76=          push @todo, [$url, $location]; # but get this too
        =77=        } elsif ($response->is_error) {
        =78=          print "$url ERROR: ", $response->status_line, "\n";
        =79=        }
        =80=    
        =81=      }                             # end of outer loop
        =82=    
        =83=      {
        =84=        my $total = 0;
        =85=    
        =86=        print "$start =>\n";
        =87=        for my $url (sort { $done{$b} <=> $done{$a} } keys %done) {
        =88=          $total += $done{$url};
        =89=          printf "  %10d  %s\n", $done{$url}, $url;
        =90=        }
        =91=        printf "  %10d TOTAL\n", $total;
        =92=        printf "  %10.0f seconds at 28.8\n\n", $total/2000;
        =93=      }
        =94=    
        =95=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Web Techniques Column 63 (Jul 2001)

Listings