Copyright Notice

This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Linux Magazine Column 21 (Feb 2001)

[suggested title: 'Getting some directory assistance']

Most Perl scripts aren't doing anything glamorous. They're the workhorse of your system, moving things around while you aren't necessarily looking, and handling those mundane repetitive tasks.

Those tasks are often on a series of filenames, perhaps not known in advance, but obtained by looking at the contents of a directory. Perl has a few primary different means of getting lists of names, so let's take a look at them.

The simplest to use and understand is globbing. Globbing is what the shell does when you use echo *.c to get a list of all the C source files in a directory. The term globbing comes from the use of the old /etc/glob program in early versions of Unix, with a name derived from something like ``global expansion''.

Now, most programs running from the shell don't have to know how to do globbing for themselves. For example, the rm command in:

  $ rm *.c

never sees the *.c. Instead, the shell expands (globs) the filename pattern, comes up with a list of names, and then hands those names to the arguments of rm. This is why the rm command cannot help you when you've accidentally typed a space between the asterisk and the period: it never sees the asterisk, but rather a list of explicit names, just as if you'd laboriously typed all of them directly.

Similarly, if you invoke your Perl program with glob pattern on the command line:

  $ my_perl_prog *.c

then your Perl program already has the expanded values, and nothing further needs to be done to process the elements of @ARGV.

But sometimes, you don't have the luxury in your Perl program of having the files already all be passed on the command line. What to do then? Use the glob operator from with Perl!

  my @c_source = glob "*.c";

Here, @list will be loaded up with all the names in the current directory that don't begin with a dot but do end in .c, just as if I had handed that to the shell for expansion. To get all the C source files and object files, I can use either of:

  my @c_source_and_object = glob "*.c *.o";

  my @c_source_and_object = glob "*.[co]";

Notice that multiple patterns can be specified in one glob by separating them with whitespace, similar to the shell, or we can use a character-class-like entry.

Another way to write the glob operator is to put angle brackets around the glob pattern:

  my @c_source_and_object = <*.c *.o>;

The value between the angle brackets is interpreted as if it were a double-quoted string, so Perl variables become their current Perl values before the glob is evaluated. This lets us vary the patterns at runtime:

  for my $suffix (qw(.c .o .out)) {
    $files_with{$suffix} = [<*$suffix>];
  }

Here, I'm creating a hash of arrayrefs, so $files_with{".o"} will be an arrayref of all matching files.

Either syntax is fine: the glob named operator is a fairly recent invention (and takes five more characters of typing), so legacy programs tend to use the angle bracket version as well.

One word of caution about the angle bracket syntax: if the only thing inside the angle brackets is just a simple scalar variable, then angle brackets take on their more familiar meaning of ``read a line from a filehandle''. But here, the filehandle is an ``indirect'' filehandle, meaning that the variable contains the name of or a reference to a filehandle. If you're not sure whether you'll be getting a glob or not, then always use the glob named operator.

Globbing can perform anything the shell normally does. For example, get all the files in the current directory or any first-level nested subdirectory that end in .c:

  my @many_c_files = <*.c */*.c>;

So here, we're reading many subdirectories potentially. Directories that are two levels down are still ignored, however. The normal Perl globbing syntax doesn't have an entry for ``recursively descend'', in spite of many modern shell extended globbing forms which can indeed handle that.

Also, just as in the shell, files that begin with a dot will not have their dot matched by a wildcard character. Instead, the dot must be matched explicitly, giving us the easy equivalent of the ``hidden file''. To get all the files, we need two separate glob patterns, perhaps both invoked in the same expression:

  my @everything = <.* *>;

The resulting list includes all files, with or without dots. The separate lists are sorted individually, but not merged. If you want the entire list sorted together, you've got to manage that on your own:

  my @sorted_everything = sort <* .*>;

Of course, the output of glob can easily be used as the input to other operations. Here's the equivalent of rm -i *:

  for my $filename (<*>) {
    print "remove $filename? ";
    next unless <STDIN> =~ /^y/i;
    unlink $filename or warn "Cannot unlink $filename: $!";
  }

As simple as globbing is to use and understand, it doesn't come without its drawbacks. Prior to the 5.6 release of Perl (and dating all the way back to Perl 1.0 in 1987), globbing was implemented by literally forking off a C-shell behind the scenes (or a Bourne-style shell if C-shell was not available) and asking that shell to expand the globs. This had several consequences.

For one thing, the globbing syntax was actually slightly dependent on the particular shell being used behind the scenes. As long as you stayed with the simple star, question mark, square brackets stuff, you'd be fine, but if perchance you took advantage of curly-brace alternations, and then moved to a box without that, your program would blow up.

Second, the syntax was sensitive to shell special characters. For example, one of my ``Just another Perl hacker'' signatures read something like this:

  print <;echo Just another Perl hacker,>;

which works because the child shell's glob operation was terminated by the semicolon, and then we began a new operation, which would show up as a single filename to the shell-to-Perl interface, which then became the return value from the globbing operation, and dumped out to STDOUT via print. Scary, when you then consider the full security implications of passing an arbitrary string as part of a glob pattern.

Third, because the shell was a separate process, each glob incurred the expense of a fork/exec operation. Fine if you do it once or twice in a program, but prohibitively expensive to get, say, every file of every directory below a given large directory.

And finally, and perhaps most significantly, the classic C-shell had a fixed-size buffer for globbing expansion (roughly 10K if my memory serves me right). If you've ever gone into a ``fat'' directory (with lots of long names) and typed rm * only to be greeted with ``NCARGMAX exceeded'' or some equally obscure error message, you've seen this in action. So, the C-shell can expand only so many names, but since Perl is counting on the C-shell for a complete expansion, Perl also loses.

And this lead most people who were wanting to write robust, efficient, and secure directory lookups to avoid glob entirely, and jump directly to a lower-level mechanism for directory access: the directory handle.

A directory handle is like a filehandle, you open it (with opendir), read from it (with readdir) and perhaps close it when you are done (with closedir). I say perhaps because directory handles, like filehandles, close automatically at the end of the program, or whenever the handle is successfully reopened.

In a scalar context, readdir returns one item at a time. In a list context, readdir returns all items, again, just like a filehandle. But what items?

Well, we'll get back the contents of the directory as a list of names. This list of names is not sorted in any particular order, and consists of the basenames only (everything after the final slash of a pathname) of the entries within that directory. These entries include everything, such as plain files, directories, and even Unix-domain sockets. But it also includes files that begin with a dot, and especially the mandatory entries of ``.'' and ``..''. The entries are also unsorted (for speed).

So, to dump everything in the current directory, we'd could use this:

  opendir HERE, "." or die "Cannot opendir .: $!";
  foreach my $name (readdir HERE) {
    print "one name in the current directory is $name\n";
  }
  closedir HERE;

The closedir isn't necessary here, but does free up a few resources that would otherwise be tied up until program's end. The names of this listing will be the same order and contents as an ls -f command or a find . -print if there were no subdirectories. To get just the same thing as ls with no options, we'll need to toss the entries that begin with dot, and sort them alphabetically:

  opendir HERE, "." or die "opendir: $!";
  foreach my $name (sort grep !/^\./, readdir HERE) {
    print "$name\n";
  }
  closedir HERE;

Because the names are simply the names within the directory, and not the full pathnames, they aren't directly useable or testable. For example, consider this incorrect code to pick out all the directories of a given directory:

  opendir THERE, "/usr" or die "opendir: $!";
  foreach my $name (readdir THERE) {
    next unless -d $name; # THIS IS WRONG
    print "one directory in /usr is $name\n";
  }

This is wrong because one of the names returned by readdir will be, say, lib, which we are then testing for directory-ness as if it were in the current directory! One solution is to patch up the name to include the full path before we use it with file tests or further access. Here's a refined solution that skips over dot-files as well, making all directories immediately under /usr to be mode 755 (read/write/execute for root, and read/execute for group and others):

  opendir THERE, "/usr" or die "opendir: $!";
  foreach my $name (readdir THERE) {
    next if /^\./; # skip over dot files
    my $fullname = "/usr/$name"; # get full name
    next unless -d $fullname;
    chmod 0755, $fullname or warn "Cannot chmod $fullname: $!";
  }
  closedir THERE;

What about subdirectories? What if we wanted to examine every directory recursively below /usr looking for world writable entries? Well, we could certainly use find for that, but in Perl, it's not much harder to write this:

  use File::Find;
  find sub {
    return unless -d; # is it a directory?
    return unless (stat)[2] & 2; # and world writeable?
    print "$File::Find::name is world writable!\n";
  }, "/usr";

The first use defines the find subroutine. This subroutine expects a ``coderef'' as its first argument, which we're providing by using an anonymous subroutine. The remaining arguments to find are a list of top-level starting points for which find will locate all names recursively. For each found entry, find calls the subroutine, passing the basename of the entry in $_ and the full name in $File::Find::name. In addition, the working directory has been changed to that of the entry (for speed on further file tests).

So in this example, I tested $_ to see if it was a directory, and if so, then further tested its ``stat 2'' element (the tricky one with the type encoded along with the permissions values) to see if the second bit from the right was set. That's the world-writeable bit. If both of those were successful tests, we sail on to print out the full name. (Printing $_ there would be not very helpful, since that's just the basename).

Note that this subroutine in its simplicity will actually print each name twice. Once while we are looking at the directory ``from above'', and once when the name is passed as ``dot'' in the ``current directory''. To reject that, you could add:

  return if $_ eq "." or $_ eq "..";

near the beginning of the subroutine. Now we'll get just the names, although we'd never find /usr as a world-writeable directory. For that, it'd take a little more sophisticated juggling.

The File::Find module is included with Perl (from all the way back to Perl 5.000), so there's no excuse not to use it whenever you think of anything to do with with recursing down directories. There's a version that does ``depth first'' recursion (giving you the names before the containing directory) and a mechanism for pruning the tree if you head into areas of non-interest. The version included with Perl 5.6 also has the ability to follow symlinks and provide sorted names, so check the documentation to stay up to date.

I hope this directory assistance has got your number now. Until next time, enjoy!

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Linux Magazine Column 21 (Feb 2001)