Lesson 16

Over the top

So now you know how to subvert perl's hashes to do whatever you want (check out Tie::Hash::Cannabinol to see how to get perl stoned). How about subverting its operators? When does 2+2=5? When programming with objects, obviously. And how do we undermine the whole basis of arithmetic? With the overload pragma. If you cast your mind back to when I first started nagging about use strict, you might remember that I mentioned that strict was a pragmatic module. Pragmatic modules (pragmata) are a little more special than your average run-of-the-mill modules in that they manipulate the way in which perl parses and executes your code, rather than what functionality is available. overload is another pragma, which allows you to change the way in which perl does maths and quoting. Since I have issues with Imperial units, what happens if you want 1m + 1" = 1.025m? We can use overload to change how addition works, and purge inches forever. Since overloading is something we can only do to objects, this is obviously part of an object oriented approach. Hence we'll need to write a perl class which can take numbers with their units and store them as objects:

package PurgeImperial;
use strict;
sub new
{
    my ( $class, $value ) = @_;
    my ( $number, $unit ) = split /\s+/, $value;
    return unless $unit   =~ /^(in|m)$/;
    if ( $unit eq "in" ){ $number *= 0.0254; $unit = "m" }
    return bless { number => $number, unit => $unit }, $class;
}

To use the module, we simply create objects like my $imperial = PurgeImperial->new( "6 in" ); with a number and its unit separated by spaces. Internally, the object silently converts inches to metres and stores the new value, with its unit "m". Simple. Now for the overloading. To overload an operator, like addition, we need to use the overload pragma, and tell it the names of methods to replace operators with. Between use strict and sub new, we need to wodge this in:

use overload ( '+' => "add" );

use overload takes a hash as an argument. Each key/value pair is an operator to overload, and a method that will be used to implement it. We then need to write the add method, which is capable of adding two of our objects together in the right way, and produce a suitable new object:

sub add
{
    my ( $object1, $object2, $was_reversed ) = @_;
    my $class = ref $object1;
    $object2 = $class->new( $object2 )
        unless UNIVERSAL::isa( $object2, $class );
    ( $object1, $object2 ) = ( $object2, $object1 )
        if $was_reversed; # see explanation below
    my $sum = $object1->{number} +  $object2->{number};
    return $class->new( "$sum m" );
}

The add method is passed three things when it is called by overload: the first two are the objects involved in the addition, the third is a value (true or false) that tells you whether the objects have been reversed. The reason for this is that the first argument to an overloaded method is always an object, hence if you ask perl to do this:

3 + $overloaded_object;

where we are trying to add an object to something that isn't an object, perl will actually ask overload to do this:

add( $overloaded_object, 3, "REVERSED!!!" );

or similar. For addition, this (probably) doesn't make any difference, but for subtraction, you must be careful:

$overloaded_object - 3;

is called as:

subtract( $overloaded_object, 3 );

as you would expect. However,

3 - $overloaded_object;

is called as:

subtract( $overloaded_object, 3, "REVERSED!!!!" );

and unless you take note of the trueness of the reversal, you'll end up with the same result from both, which is unlikely to be what you're after! Anyway, the method is otherwise simple enough to understand: we first find out what sort of object(s) we are trying to add (remember that if someone inherits your class, this may not necessarily be PurgeImperial: never hardwire a class name in), by asking for ref $object1. Then we ensure that the second object passed to add is capable of being added to our object: we do this by calling the isa() method, which all objects inherit from the UNIVERSAL class (the root class of all perl objects). This method returns TRUE if an object is a member of our class (or of its possible subclasses). If the second object is a PurgeImperial, then we carry on, otherwise we attempt to convert it by calling new(). Then we (pointlessly) unreverse the objects and add their innards together appropriately. Finally, we create a new object of the appropriate sort $class->new( "$sum m" ), thereby forcing the answer to be in metres. Hahaha. So let's see what the answer is:

#!/usr/bin/perl
use strict;
use PurgeImperial;
my $x = PurgeImperial->new( "1 in" );
my $y = PurgeImperial->new( "1 m" );
my $z = $a + $b;
print "$z\n";
PurgeImperialOperation `""': no method found, 
  argument in overloaded package PurgeImperial

Ah. What have we done wrong? Well, objects aren't inherently printable things: the nearest we can get to printing an object is:

use Data::Dumper;
print Dumper \$z;
$VAR1 = \bless( {'unit' => 'm', 'number' => '1.0254'}, 'PurgeImperial');

Or is it? In fact, we can overload the quote operator too, so that when we try to print "$z\n" our object returns something printable and useful. First we need to change the overloading to reflect this:

use overload ( '+' => "add", q("") => "quote" );

The "" is the bizarre name of the quote operator, which itself needs quoting or strict will whinge, so we use the q() operator. Ugh. Just do it, OK. We then need to create a quote method:

sub quote
{
    my ( $object ) = @_;
    return join " ", $object->{number}, $object->{unit};
}

If we now run the print "$z\n" code, we get:

1.0254 m

In fact, we can even do this:

print PurgeImperial->new( "12 in" );
0.3048 m

Hahaha. I have made perl imperial-phobic: it can't even say "inch" anymore. Using the overload pragma can be extremely useful, as it adds to the opaqueness and simplicity of your class API: if $x and $y are overloaded objects, it's far more intuitive to type $x + $y than $c = $a->add( $b ) or similar. The overloading on quoting is also useful for similar reasons: it's much easier to understand 1.0254 m than the output of Data::Dumper (particularly if the object's innards are gnarly). Besides, spilling an object's guts is bad OO form. Unless it's absolutely necessary ☺

It would be fairly easy to implement some sort of unit/number association with techniques similar to the above. I gave it a go for a laugh a while back, when my perl skills weren't very good, but I eventually got bored and confused when I tried to work out what the units of $answer in something like:

my $answer = ( Units->new( "23 mWb m-2" ) ** 2 ) / 
    ( Units->new( "35 cm ns-1" ) );

should be, and worrying about whether radians were really dimensionless, whether you could take the logarithm of a unit-bearing number, and what 1 Bq * 1 Hz was? 1 BqHz? 1 s−2? 1 Bq−2? Ugh. I'll leave it to someone more competent.

The comparison operators can also be overloaded (so you can make 6 in < 3 m despite the fact that 6 > 3). Unfortunately there are rather a lot of these operators (< > <= >= == != lt gt le ge eq ne), but fortunately, for most simple cases, you only need to overload <=> or cmp (the comparison operators) and perl will work out the rest from that (since knowing where something ought to come in an ordered list means also knowing the answer to whether two objects are equal, etc.):

use overload ( "<=>" => "compare" );
sub compare
{
    my ( $object1, $object2, $was_reversed ) = @_;
    ( $object1, $object2 ) = ( $object2, $object1 ) if $was_reversed;
        # see explanation below
    return $object1->{number} <=> $object2->{number};
        # or explicitly work out the 0, +1 and -1 for complex cases
}

Or similar. Likewise, overload can bootstrap $a++ as $a = $a + 1, so if your class's add method can deal with addition of a pure number to the object, then you don't have to overload ++ too.

A final neat trick you can do with overloading is to overload perl's interpretation of string and number constants, so that when we come across the string '4 in', it will automatically be converted to an object of the PurgeImperial class, without even having to bother with all that tiresome new() stuff. If we add this to our class:

my %handlers =
(
    q => sub
    {
        print "doing $_[0]";
        return __PACKAGE__->new( $_[0] ) || $_[1];
    }
);
sub import
{
    overload::constant( %handlers ) if $_[ 1 ] eq ':constants';
}
sub unimport
{
    overload::remove_constant( q => undef );
}

the when perl finds strings or numbers in a script, it calls the overload module to deal with them. To use this functionality, we just need to initialise the the overload::constants function with a hash. This hash (called %handlers here) is keyed on the name of the constant found (integer or float for numbers, q for strings), and the values are coderefs that will be called when perl comes across a constant of the appropriate type.

The coderef will be called with three arguments: the literal data that the parser found in your code (4 in), the way it would normally interpret this (as the string "4 in"), and the context in which it was found (q for single quoted strings, qq for double quoted, etc.). All we need to do is return our own interpretation of the constant $_[0], to replace perl's default one, $_[1]: we do this simply by calling __PACKAGE__->new( $_[0]), which calls the constructor in the current package (here PurgeImperial, but it could be a subclass: __PACKAGE__ contains whatever the current package name is). If we can't coerce $_[0] into an object of our type (if the constructor returns undef), then we simply return the default interpretation, $_[1]. Voilá.

To actually get this requires one final hoop to be jumped: we could just initialise the overload::constant function when our module is loaded. However, this is a little rude, as this involves some quite unexpected data manipulation, and we want to make sure that our user really wants this functionality. To make it optional, we write a custom import subroutine. Remember that when someone uses a module:

use modulename argument1, argument2, ...;

perl calls the import() subroutine in the module with the arguments passed to use...

module::import( modulename, argument1, argument2, ...)

Usually (if we inherit from Exporter.pm) this aliases functions into your namespace so you can call find() rather than File::Find::find(), but we can write our own custom import() very easily. We do this here: only if PurgeImperial is called like this:

use PurgeImperial ':constants';

will our custom import function prime overload::constant with our funny handlers. We also define an unimport() subroutine, so if someone calls

no PurgeImperial;

somewhere in their script, it will politely remove this functionality.

Anyway, for more details of overloading, perldoc overload: you'll find you can overload pretty much any of perl's built in operators, from sin to << to "". I think the only exceptions are the logical operators && and ||. Go and have a play, and feel free to mess with the PurgeImperial class (shoddy though it is).

Dispatching with class

Somewhat akin to overloading is the fabulous bag of tricks called Class::Multimethods. The reason that people go funny about objects and classes is that they allow you to have many subroutines called, for example, feed. You can have:

package Cat;
sub new  { return bless { stomach => 'nothing' }, $_[0] }
sub feed {  $_[0]->{ stomach } = $_[1] }
sub last_meal { $_[0]->{ stomach } }
package Dog;
sub new  { return bless { stomach => 'nothing' }, $_[0] }
sub feed {  $_[0]->{ stomach } = $_[1] }
sub last_meal { $_[0]->{ stomach } }
package main;
my $kitty = new Cat;
feed $kitty 'mice';
print "Kitty last ate ", $kitty->last_meal, "\n";
my $pooch = new Dog;
feed $pooch 'mechanically recovered meat';
print "Pooch last ate ", $pooch->last_meal, "\n";
Kitty last ate mice
Pooch last ate mechanically recovered meat

Now, you would probably not write those feed statements like that: you'd probably say $kitty->feed( 'mice' );. However, this 'indirect object' syntax is perfectly valid, and serves to show you that what looks like a simple subroutine call to feed is dispatched polymorphically, i.e. different versions of feed are called depending on the class to which the first argument ($kitty or $pooch) belongs.

Polymorphic dispatch is spendidly handy, as you have no doubt seen. However, what happens if we decide that what we really want to do is feed the $kitty a Mouse object, rather than a string 'mouse'? We want the code to be bright enough to call a different subroutine/method based not just on the class of the first argument to feed but also on the classes of any other objects in the argument list.

$kitty->feed( 'mouse' );
# should call a different method to
$kitty->feed( Mouse->new );

This is called 'multimethod dispatch', and can be achieved through the use of the wonderful Class::Multimethods. To use Class::Multimethods, you need to define many different versions of the same method, each with a different signature (prototype, argument list, choose whichever syntax you prefer) of classes. It is often useful to define these multimethods in a superclass, from which your other classes inherit:

package Animal;
use Class::Multimethods;
# our multimethods consist of a name, a signature of classes, and a coderef
multimethod feed =>
    ( 'Cat', 'Mouse' ) =>
    sub {  $_[0]->{ stomach } = $_[1] };
# note that these are subroutine calls to the 'multimethod' 
# function exported from Class::Multimethods, so they must be 
# terminated by a semicolon
multimethod feed =>
    ( 'Dog', 'Mouse' ) =>
    sub {  warn "Unlikely"; return };
multimethod feed =>
    ( 'Cat', 'Cat' ) =>
    sub {  warn "Cannibalism!"; return };
# this one allows a cat to eat some string
multimethod feed =>
    ( 'Cat', '$' ) =>
    sub {  $_[0]->{ stomach } = $_[1] };
multimethod feed =>
    ( 'Dog', 'Cat' ) =>
    sub {  $_[0]->{ stomach } = $_[1]; $_[0]->{ felicide }++ };
package Cat;
our @ISA = qw ( Animal );
  # inherit multimethod definitions from superclass 'Animal'
use overload q("") => sub { 'a smug Cat object' };
sub new  { return bless { stomach => 'nothing' }, $_[0] }
sub last_meal { $_[0]->{ stomach } }
package Dog;
our @ISA = qw ( Animal );
use overload q("") => sub { 'a slobbery Dog object' };
sub new  { return bless { stomach => 'nothing' }, $_[0] }
sub last_meal { $_[0]->{ stomach } }
package Mouse;
our @ISA = qw ( Animal );
use overload q("") => sub { 'a crunchy Mouse object' };
sub new  { return bless { stomach => 'nothing' }, $_[0] }
sub last_meal { $_[0]->{ stomach } }
package main;
my $kitty = new Cat;
$kitty->feed('mice');
print "Kitty last ate ", $kitty->last_meal, "\n";
my $micky = new Mouse;
$kitty->feed( $micky );
print "Kitty last ate ", $kitty->last_meal, "\n";
my $pooch = new Dog;
$pooch->feed( $kitty );
print "Pooch, $pooch, last ate ", $pooch->last_meal, ", 
       which last ate ", $pooch->last_meal->last_meal, "\n";
Kitty last ate mice
Kitty last ate a crunchy Mouse object
Pooch, a slobbery Dog object, last ate a smug Cat object, 
  which last ate a crunchy Mouse object

We define lots of different variation on a theme, a multimethod for each eventuality. In theory we should create twelve (dog eats dog/cat/mouse/string, cat eats dog/cat/mouse/string, mouse eats dog/cat/mouse/string), but I am not that patient.

The cat is still capable of eating a mouse string, which we ensure by defining a multimethod that has $ as its second argument, but our code also ensures that it can eat Mouse objects and warn if it is being forced into cannibalism. In addition to the pseudoclass $, there is an equivalent pseudoclass # that you can use when (unquoted) numbers are dispatched to methods calls:

$kitty->feed( 3.142 );
# pie makes kitties puke
package Animal;
multimethod feed =>
    ( 'Cat', '#' ) =>
    sub {  warn "Kitties don't like pie"; $_[0]->{ stomach } = 'nothing'; };

Note that the code also uses overloading to ensure that the objects print out something nice when put in stringifying context (as provided by double-quotes and/or print).

Using tie, overload and Class::Multimethods allows you to pretty much do what you like to ensure that objects DWYM rather than what perl thinks they ought to do. A word of warning though: these techniques are very useful, but they do tend to slow perl down somewhat, so use them judiciously.

Sort it out

For want of something more relevant to round this installment off with, I have to resort to the random bag of tricks cop-out instead. First, constants, wantarray and prototypes, second, sorting and the hallowed Schwartzian Transform.

Prototypes and wantarray are a way of simulating perl functions, specifically, allowing you to write subroutines that behave differently in list and scalar context, and force list or scalar context on their arguments. Now, prototypes are rather broken in Perl, so if you can think up a better way of doing things without them, then do that instead. Prototypes look something like this:

sub force_scalar ($) { my $arg = shift; print "argument is $arg\n"; }
my @wood = qw( phellum phellogen cortex phloem cambium xylem pith );
force_scalar @wood;
argument is 7

The ($) between the sub's name and block tell perl that the subroutine expects a single scalar argument, and if it receives something else (like an array), it should force it into scalar context. More complex prototypes are possible, but herein lies a whole world of hurt: prototypes fail to DWIM most of the time, so be careful when you use them! Note also that if you define a subroutine before you call it, you don't need to use parentheses, hence force_scalar @wood, not force_scalar( @wood ). However, it doesn't hurt, and may save you from some unpleasantry. There are two main uses of prototypes that are quite useful: the first is to create constant subroutines:

sub constant { return 666; } # no prototype
print constant + 111;
666
sub constant () { return 666; } # void prototype
print constant + 111;
777

In the first case, there is no prototype. The unfortunate consequence of this is that print constant + 111 is parsed as:

print( constant (+111) );

The constant subroutine does not need an argument, but in the absence of a void prototype, it doesn't know this, so it grabs the signed integer +111 anyway, and then throws it away, hence the strange output. In the second case, we explicitly tell perl that constant takes no arguments with a void prototype () and then it parses as we mean:

print( constant() + 111 );

The second use of prototypes that isn't harmful is to pass things by reference without having to backwhack them. If you want to write a function that does to hashes what push does for arrays, you can use this:

sub hpush(\%@)
{
    my $href = shift; # note the $href, not a %hash
    while ( my ($k, $v) = splice( @_, 0, 2 ) )
        # use splice to 'pop2', as it were
    {
        $href->{$k} = $v;
    }
}
hpush
(
    %male,
    spermatium      => "Teliomycetidae",
    "pollen nuclei" => "Spermatophyta",
    sperm           => "Metazoa",
);

The hpush subroutine takes a prototype of (\% @). The first item (\%) tells perl to expect a hash as the first argument (a real, literal hash, with a % and everything, not just a list with fat commas =>), and to pass that hash by reference to the subroutine. The second item (@) tells it to expect a list as the second, third, fourth, etc, arguments. This allows you to call the subroutine as hpush( %hash, LIST ), rather than as hpush( \%hash, LIST ), so mimicking push( @array, LIST ), rather than the hypothetical and ugly push( \@array, LIST ).

If you remember that prototypes are for making functions look and behave like Perl built-ins and you'll be fine. Expect anything more from them, and it'll all end in tears.

Prototypes' mirror image is wantarray, which allows you to respond appropriately and polymorphically to being called in list or scalar context:

my $scalar = latinize( "maidenhair tree" );
print "Scalar: $scalar\n";
my @list = latinize( "monkey puzzle tree" );
print "List:\n";
print "\t$_\n" for @list;
sub latinize
{
    my $name = shift;
    my %dictionary =
    (
        "maidenhair tree"    => [ "Ginkgo", "biloba", "L." ],
        "monkey puzzle tree" => [ "Araucaria", "araucara", "Mol.(Koch)" ],
    );
    return wantarray ?
        @{ $dictionary{ $name } } :
            join " ", @{ $dictionary{$name} }[ 0, 1 ];
}
Scalar: Ginkgo biloba
List:
Araucaria
araucara
Mol.(Koch)

wantarray is TRUE if the sub is called in list context, and FALSE if called in scalar context. It's often used in this form at the end of a subroutine:

return wantarray ? LIST_VERSION : SCALAR_VERSION;

In the code above, the subroutine returns a Latin binomial (Genus species) in scalar context, and a list of Genus, species, author in list context. It is quite possible to go overboard on wantarray, and have subroutines that always respond in different ways to different contexts. Beware such overhelpfulness: it's not always as helpful as you might think, if for no other reason than making the docs twice as long as they might need to be ☺

Enough on user-friendliness, how about something on efficiency? A common problem when performing sorts is that the internal implementation of sort has to access the data on which the sort is performed several times for each item in the unsorted list. If accessing this data is time consuming (if you have to use a system call for example), the sort can take rather a long time. The canonical example is trying to sort a list of filenames by their age:

#!/usr/bin/perl
use strict;
my @filenames = glob "data/bacteria/sequences/*.seq";
# glob calls a C-shell style globbing function
sub brute_force { -M $a <=> -M $b }
@sorted = sort brute_force @filenames;

There are two famous ways to get around this expensive look-up, the Orcish manoeuvre and the Schwartzian transform. Both squirrel away the key you are sorting on so that you only need to look it up once. The Orcish manoeuvre is the simplest to understand:

#!/usr/bin/perl
use strict;
my @filenames = glob "data/bacteria/sequences/*.seq";
{
    my %cached; # lexically scoped in a bare block to keep things tidy
    sub orcish
    {
        ( $cached{$a} ||= -M $a )
                  <=>
        ( $cached{$b} ||= -M $b )
    }
}
@sorted = sort orcish @filenames;

In the Orcish manoeuvre, we check to see if we have already looked up the value of -M $a or -M $b by requesting the value of $cached{$a} or $cached{$b}. If we haven't previously cached them, we store the age in the cache, so that next time we don't have to do the expensive system call to get the age. If we have previously cached them, we just get the value from the cache anyway. The name is derived from the ||=, which could be read Or-Cache if you like.

You can compare the Orcish and brute force techniques using the Benchmark module:

#!/usr/bin/perl
use strict;
use Benchmark;
my @filenames = glob "data/bacteria/sequences/*.seq";
sub brute_force { -M $a <=> -M $b }
{
    my %cached; # lexically scoped in a bare block to keep things tidy
    sub orcish
    {
        ( $cached{$a} ||= -M $a )
               <=>
        ( $cached{$b} ||= -M $b )
    }
}
timethese
(
    10000, # 10 000 iterations of each
    {
        "Brute force" => 
            sub { my @sorted = sort brute_force @filenames },
        "Orcish" => 
            sub { my @sorted = sort orcish @filenames },
    }
);
Benchmark: timing 10000 iterations of Brute force, Orcish...
Brute force: 133 wallclock secs 
  (30.02 usr + 102.50 sys = 132.52 CPU) @ 75.46/s (n=10000)
Orcish: 1 wallclock secs 
  ( 1.43 usr + 0.00 sys = 1.43 CPU) @ 6983.24/s (n=10000)

Benchmark's function timethese() takes a number of iterations (10 000 here) and a hashref of name => coderef pairs as arguments, and times how long the coderefs take to run. The Orcish manoeuvre is much faster.

The Schwartzian transform (named after Randal L. Schwartz) is even faster, but rather more complex at first glance:

#!/usr/bin/perl
use strict;
my @filenames = glob "data/bacteria/sequences/*.seq";
my @sorted =
map
{
    $_->[0]
    # 4. Construct the list of names by extracting 
    # them from the sorted arrayrefs
}
(
    sort
    {
        $a->[1] <=> $b->[1]
        # 3. Sort this list of arrayrefs based on their file ages
    }
    (
        map
        {
            [$_, -M]
            # 2. Convert them into list of [ filename, age ] arrayrefs
        }
        (
            @filenames
            # 1. Take a list of filenames
        )
    )
);
print "$_\n" for @sorted;

Take it backwards and it all makes sense: first we take @filenames, and create a list of [ filename, age ] arrayrefs using map. Then we sort based on the age ( $a->[1] <=> $b->[0] ), to create an ordered list of these [ filename, age ] arrayrefs. Finally, we grab the filenames $_->[0] and create a final list of sorted filenames using map again. This map-sort-map is very elegant, and very fast (the array lookups are quicker than the hash lookups in the Orcish manoeuvre), and definitely a technique worth keeping in your code toolbox.

Next…