Copyright Notice
This text is copyright by InfoStrada Communications, Inc., and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in Linux Magazine magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Linux Magazine Column 21 (Feb 2001)
[suggested title: 'Getting some directory assistance']
Most Perl scripts aren't doing anything glamorous. They're the workhorse of your system, moving things around while you aren't necessarily looking, and handling those mundane repetitive tasks.
Those tasks are often on a series of filenames, perhaps not known in advance, but obtained by looking at the contents of a directory. Perl has a few primary different means of getting lists of names, so let's take a look at them.
The simplest to use and understand is globbing. Globbing is what
the shell does when you use echo *.c
to get a list of all the C
source files in a directory. The term globbing comes from the use of
the old /etc/glob
program in early versions of Unix, with a name
derived from something like ``global expansion''.
Now, most programs running from the shell don't have to know how to do
globbing for themselves. For example, the rm
command in:
$ rm *.c
never sees the *.c
. Instead, the shell expands (globs) the
filename pattern, comes up with a list of names, and then hands those
names to the arguments of rm
. This is why the rm
command cannot
help you when you've accidentally typed a space between the asterisk
and the period: it never sees the asterisk, but rather a list of
explicit names, just as if you'd laboriously typed all of them
directly.
Similarly, if you invoke your Perl program with glob pattern on the command line:
$ my_perl_prog *.c
then your Perl program already has the expanded values, and nothing
further needs to be done to process the elements of @ARGV
.
But sometimes, you don't have the luxury in your Perl program of
having the files already all be passed on the command line. What to
do then? Use the glob
operator from with Perl!
my @c_source = glob "*.c";
Here, @list
will be loaded up with all the names in the current
directory that don't begin with a dot but do end in .c
, just as if
I had handed that to the shell for expansion. To get all the C source
files and object files, I can use either of:
my @c_source_and_object = glob "*.c *.o";
or
my @c_source_and_object = glob "*.[co]";
Notice that multiple patterns can be specified in one glob
by
separating them with whitespace, similar to the shell, or we can use a
character-class-like entry.
Another way to write the glob
operator is to put angle brackets
around the glob pattern:
my @c_source_and_object = <*.c *.o>;
The value between the angle brackets is interpreted as if it were a double-quoted string, so Perl variables become their current Perl values before the glob is evaluated. This lets us vary the patterns at runtime:
for my $suffix (qw(.c .o .out)) { $files_with{$suffix} = [<*$suffix>]; }
Here, I'm creating a hash of arrayrefs, so $files_with{".o"}
will
be an arrayref of all matching files.
Either syntax is fine: the glob
named operator is a fairly recent
invention (and takes five more characters of typing), so legacy
programs tend to use the angle bracket version as well.
One word of caution about the angle bracket syntax: if the only thing
inside the angle brackets is just a simple scalar variable, then angle
brackets take on their more familiar meaning of ``read a line from a
filehandle''. But here, the filehandle is an ``indirect'' filehandle,
meaning that the variable contains the name of or a reference to a
filehandle. If you're not sure whether you'll be getting a glob or
not, then always use the glob
named operator.
Globbing can perform anything the shell normally does. For example,
get all the files in the current directory or any first-level nested
subdirectory that end in .c
:
my @many_c_files = <*.c */*.c>;
So here, we're reading many subdirectories potentially. Directories that are two levels down are still ignored, however. The normal Perl globbing syntax doesn't have an entry for ``recursively descend'', in spite of many modern shell extended globbing forms which can indeed handle that.
Also, just as in the shell, files that begin with a dot will not have their dot matched by a wildcard character. Instead, the dot must be matched explicitly, giving us the easy equivalent of the ``hidden file''. To get all the files, we need two separate glob patterns, perhaps both invoked in the same expression:
my @everything = <.* *>;
The resulting list includes all files, with or without dots. The separate lists are sorted individually, but not merged. If you want the entire list sorted together, you've got to manage that on your own:
my @sorted_everything = sort <* .*>;
Of course, the output of glob
can easily be used as the input to
other operations. Here's the equivalent of rm -i *
:
for my $filename (<*>) { print "remove $filename? "; next unless <STDIN> =~ /^y/i; unlink $filename or warn "Cannot unlink $filename: $!"; }
As simple as globbing is to use and understand, it doesn't come without its drawbacks. Prior to the 5.6 release of Perl (and dating all the way back to Perl 1.0 in 1987), globbing was implemented by literally forking off a C-shell behind the scenes (or a Bourne-style shell if C-shell was not available) and asking that shell to expand the globs. This had several consequences.
For one thing, the globbing syntax was actually slightly dependent on the particular shell being used behind the scenes. As long as you stayed with the simple star, question mark, square brackets stuff, you'd be fine, but if perchance you took advantage of curly-brace alternations, and then moved to a box without that, your program would blow up.
Second, the syntax was sensitive to shell special characters. For example, one of my ``Just another Perl hacker'' signatures read something like this:
print <;echo Just another Perl hacker,>;
which works because the child shell's glob operation was terminated by
the semicolon, and then we began a new operation, which would show up
as a single filename to the shell-to-Perl interface, which then became
the return value from the globbing operation, and dumped out to
STDOUT
via print
. Scary, when you then consider the full
security implications of passing an arbitrary string as part of a glob
pattern.
Third, because the shell was a separate process, each glob
incurred
the expense of a fork/exec operation. Fine if you do it once or twice
in a program, but prohibitively expensive to get, say, every file of
every directory below a given large directory.
And finally, and perhaps most significantly, the classic C-shell had a
fixed-size buffer for globbing expansion (roughly 10K if my memory
serves me right). If you've ever gone into a ``fat'' directory (with
lots of long names) and typed rm *
only to be greeted with
``NCARGMAX exceeded'' or some equally obscure error message, you've seen
this in action. So, the C-shell can expand only so many names, but
since Perl is counting on the C-shell for a complete expansion, Perl
also loses.
And this lead most people who were wanting to write robust, efficient,
and secure directory lookups to avoid glob
entirely, and jump
directly to a lower-level mechanism for directory access: the
directory handle.
A directory handle is like a filehandle, you open it (with
opendir
), read from it (with readdir
) and perhaps close it when
you are done (with closedir
). I say perhaps because directory
handles, like filehandles, close automatically at the end of the
program, or whenever the handle is successfully reopened.
In a scalar context, readdir
returns one item at a time. In a list
context, readdir
returns all items, again, just like a filehandle.
But what items?
Well, we'll get back the contents of the directory as a list of names. This list of names is not sorted in any particular order, and consists of the basenames only (everything after the final slash of a pathname) of the entries within that directory. These entries include everything, such as plain files, directories, and even Unix-domain sockets. But it also includes files that begin with a dot, and especially the mandatory entries of ``.'' and ``..''. The entries are also unsorted (for speed).
So, to dump everything in the current directory, we'd could use this:
opendir HERE, "." or die "Cannot opendir .: $!"; foreach my $name (readdir HERE) { print "one name in the current directory is $name\n"; } closedir HERE;
The closedir
isn't necessary here, but does free up a few resources
that would otherwise be tied up until program's end. The names of this
listing will be the same order and contents as an ls -f
command or
a find . -print
if there were no subdirectories. To get just the
same thing as ls
with no options, we'll need to toss the entries
that begin with dot, and sort them alphabetically:
opendir HERE, "." or die "opendir: $!"; foreach my $name (sort grep !/^\./, readdir HERE) { print "$name\n"; } closedir HERE;
Because the names are simply the names within the directory, and not the full pathnames, they aren't directly useable or testable. For example, consider this incorrect code to pick out all the directories of a given directory:
opendir THERE, "/usr" or die "opendir: $!"; foreach my $name (readdir THERE) { next unless -d $name; # THIS IS WRONG print "one directory in /usr is $name\n"; }
This is wrong because one of the names returned by readdir
will be,
say, lib
, which we are then testing for directory-ness as if it
were in the current directory! One solution is to patch up the name
to include the full path before we use it with file tests or further
access. Here's a refined solution that skips over dot-files as well,
making all directories immediately under /usr
to be mode 755
(read/write/execute for root, and read/execute for group and others):
opendir THERE, "/usr" or die "opendir: $!"; foreach my $name (readdir THERE) { next if /^\./; # skip over dot files my $fullname = "/usr/$name"; # get full name next unless -d $fullname; chmod 0755, $fullname or warn "Cannot chmod $fullname: $!"; } closedir THERE;
What about subdirectories? What if we wanted to examine every
directory recursively below /usr
looking for world writable
entries? Well, we could certainly use find
for that, but in Perl,
it's not much harder to write this:
use File::Find; find sub { return unless -d; # is it a directory? return unless (stat)[2] & 2; # and world writeable? print "$File::Find::name is world writable!\n"; }, "/usr";
The first use
defines the find
subroutine. This subroutine
expects a ``coderef'' as its first argument, which we're providing by
using an anonymous subroutine. The remaining arguments to find
are
a list of top-level starting points for which find
will locate all
names recursively. For each found entry, find
calls the subroutine,
passing the basename of the entry in $_
and the full name in
$File::Find::name
. In addition, the working directory has been
changed to that of the entry (for speed on further file tests).
So in this example, I tested $_
to see if it was a directory, and
if so, then further tested its ``stat 2'' element (the tricky one with
the type encoded along with the permissions values) to see if the
second bit from the right was set. That's the world-writeable bit. If
both of those were successful tests, we sail on to print out the full
name. (Printing $_
there would be not very helpful, since that's
just the basename).
Note that this subroutine in its simplicity will actually print each name twice. Once while we are looking at the directory ``from above'', and once when the name is passed as ``dot'' in the ``current directory''. To reject that, you could add:
return if $_ eq "." or $_ eq "..";
near the beginning of the subroutine. Now we'll get just the names,
although we'd never find /usr
as a world-writeable directory. For
that, it'd take a little more sophisticated juggling.
The File::Find
module is included with Perl (from all the way back
to Perl 5.000), so there's no excuse not to use it whenever you think
of anything to do with with recursing down directories. There's a
version that does ``depth first'' recursion (giving you the names before
the containing directory) and a mechanism for pruning the tree if you
head into areas of non-interest. The version included with Perl 5.6
also has the ability to follow symlinks and provide sorted names, so
check the documentation to stay up to date.
I hope this directory assistance has got your number now. Until next time, enjoy!