Lesson 6

Line noise

Earlier in the lessons, we covered how to write comparisons:

if ( $string eq "Something we're interested in" )
{
    print "Ha, ha!";
}
else
{
    print "Boring";
}

What happens if there's more than one thing you're interested in though? Writing a gigantic if elsif else statement will make your head spin, and you'll never be sure you've got every possible version of the thing you'd like to match. Take for example, matching something as simple as a letter, number or underscore character:

if    ( $test eq "a" ) { print "OK" }
elsif ( $test eq "b" ) { print "OK" }
...
elsif ( $test eq "9" ) { print "OK" }
...
elsif ( $test eq "_" ) { print "OK" }
else                   { print "Not a letter, number or underscore!" }

This is a waste of time that will be over 63 eye-bending lines long, and still won't match the correct spelling of 'nave', let alone хуёво in the original Cyrillic. So, from time immemorial, there have been things called 'regular expressions' or 'regexes', which are a way of explaining to a programming language the things you want to match in a neat and tidy fashion. Unfortunately, regex are rather complicated, vary from language to language, and are really a language all of their own (a form of logic programming). Despite looking like executable line-noise, they are incredibly useful and powerful. So let's get down to them.

In Perl, regex are written in quotes, of a sort. Here is such a regex:

/\w/

The / / are the 'quotes' for the regex: the regex itself is just the \w bit. This regex does exactly what those 63 lines of code would do badly: they match a single letter, number or underscore. As you know, the \ is an escaping character: anything after it has some special meaning to perl. Unsurprisingly, the w stands for 'word', and \w will match a single occurrence of any 'word' character, which perl happily defines as a letter, underscore or number (i.e. things valid in the names of perl variables and subroutines). The posh name for this is a character class, which we'll cover later, for the moment, suffice it to say \w is the same as [A-Za-z0-9_] (only it can also cope with non-ASCII letters in modern, Unicode-aware, versions of perl). So the program we really want to write is:

#!/usr/bin/perl
use strict;
use warnings;
chomp( my $test = <STDIN> );
if   ( $test =~ /\w/ ) { print "OK" }
else { print "Not a letter, number or underscore!" }

The =~ is the 'binding operator'. It makes perl do the regex on the right to the variable on the left. So:

$test =~ /\w/;

and

$_ =~ /\w/;

will test $test and $_ for their wordiness respectively. In fact, (as usual), there's a shorthand for the second one: $_ is the default variable, and if perl finds a naked regex, it'll assume you mean $_ =~ naked_regex:

$_ =~ /\w/;

and

/\w/;

are exactly the same thing. If a regex matches, it returns TRUE, so:

print "Match" if /\w/;

will print "Match" only if $_ contains a word character.

Another useful way to write this is with a logical operator:

/\w/ && print "Match";

Which does the same thing: the && is a short-circuit operator, so if the first thing is FALSE (i.e. $_ is not wordy), it doesn't bother evaluating the second (i.e. print "Match"). If you want the match to fail (return FALSE) if it matches a word character, you can use !~ :

$test !~ /\w/;

or simply negate a naked regex with the not ! operator:

! /\w/;

To make our original program even tinier, we can use this default shorthand, and a new operator, the ? : operator:

#!/usr/bin/perl
use strict;
use warnings;
chomp( $_ = <STDIN>);
print /\w/ ? "OK" : "Not a letter, number or underscore!";

The ? : operator is like a tiny 'if else' statement:

print
(
    if $_ matches /\w/ ?
    then return "OK" :
    else return "Not a letter, number or underscore!"
);

A ? B : C will test A to see if it is TRUE. If it is TRUE, it returns B, if it is false, it returns C. print then gets handed whatever this statement returns, i.e. "OK", or "Not a letter…".

Now, what if we want to match more than one word character?

/\w+/;

will do just that: a + means 'one or more of the preceeding character'. So this pattern will match a, bbbbbb, d_99 and so on. However, it will also match 999;;;plop, because 999 matches /\w+/ (perl never bothers going as far as the 'plop', as it's already satisfied the match with the 999 - in fact, just with 99). If we want to make sure that we match a thing made entirely out of word characters, we can use:

/^\w+$/;

The ^ means 'beginning' and $ means 'end', (beginning and end of the string you =~ bind to the regex). So this regex will only match strings composed purely of word characters.

Another useful escape sequence is \s, which matches a space character (including both literal spaces, and \n newlines, \r carriage returns, \t tabs and a few other obscure things). To match a space only, you can just use:

/ /;

and to match a newline:

/\n/;

\d will similarly match a single digit [0-9].

An extremely important thing you can do with a regex is to capture what perl actually matched. To do this, you use ( ) parentheses within the regex:

/^(\w+)$/;

If the regex matches $_, which it will if $_ is composed entirely of 'word' characters, then the thing that \w+ matched will now be squirrelled away by perl for your perusal. How do we get at these stored goodies? Well, there are two ways. The first is to use the pattern match variables, $1, $2, $3, $4 … Whatever was captured by the first set of parentheses will appear in $1, the second set in $2, and so on. So:

/(\w(\s+)(\w+))/;

If this actually matches $_, then the entire match \w\s+\w+ will be found in $1, the space characters \s+ will be found in $2, and the last word characters \w+ will be found in $3. Unfortunately, there's currently no simple way to build hashes, or any nested structure, from regex-captures, although Perl 6 will have this ability. Another way to do this is to assign the results of the regex to a list outside the regex:

my ( $wholething, $space, $word ) = $test =~ /(\w+(\s+)(\w+))/;

Here, if the regex matches, the values of $1, $2 and $3 will be dumped into $wholething, $space and $word respectively. You may have just noticed that a regex is a context sensitive thingy: in list context it returns the match variables, in scalar context, it returns TRUE or FALSE.

By the way, if the regex:

/(\w(\s+)(\w+))/;

makes you eyes hurt, you can use the /x extended modifier, thus:

/
    (
      this in $1
        \w    # a word character
        (\s+) # some spaces, capture into $2
        (\w+) # some more word characters, capture into $3
    )
/x;

perl ignores any whitespace in a /x modified regex. Another very useful modifier is /i, which makes a regex case insensitive:

/^hello, world$/i;

will match "Hello, World", "hello, world" and indeed "HEllO, WoRLd". Note that in regex, unescaped letters and numbers mean just what you type: it's only escaped alphanumeric characters (\w word character, \d digit) and punctuation (+ one or more, ^ start of string) that mean something special.

Regexes are 'greedy' and 'lazy' by nature. If you have this situation:

#!/usr/bin/perl
use strict;
use warnings;
$_ = "hello everybody";
/(\w+)/;
print $1;
hello

$1 will end up with "hello" in it. This shows that regexes are lazy (they match at the first place in the string they can, so "hello", not "everybody"), and that they are greedy (the regex has matched the maximum possible number of letters, "hello", not just "h" or "hell"). The modifier + always tries to greedily slurp up as many characters as it can and still match the whole sequence. The same applies to *, which is zero or more of the preceeding character:

/^\w*$/;

will match any alpha_num3ric string, and also the empty string "". Another quantifier is the ?, which indicates you want to match zero or one of the preceeding character:

/Steven?/;

Will match Steve or Steven.

The second most pointless regex in the world is this:

/.*/;

The . is a special metacharacter that means 'any character except \n', so this regex will match pretty much anything as long as it's not entirely a string of newlines. The most pointless regex of all is:

/.*/s;

The /s modifier makes . match \n too (it treats a multiline string with embedded \n as a single line). So this regex matches zero or more of anything, so it will always match regardless of what $_ is!

You can specify exactly how many of a character you want using {n,m} braces:

/\w{3}/;   # matches exactly 3 alpha_num3rics
/\w{3,8}/; # matches 3 to 8 alpha_num3rics
/\w{3,}/;  # matches 3 or more alpha_num3rics
/\w{1,}/;  # pedant's version of /\w+/;
/\w{0,}/;  # pedant's version of /\w*/;
/\w{0,1}/; # pedant's version of /\w?/;

Sometimes, greedy regexes are not what you are after. You can stop regexes being greedy using the ? modifier on any of the quantifying metacharacters, i.e. * ? {n,m} and + . So:

#!/usr/bin/perl
use strict;
use warnings;
$_ = "hello everybody";
/(\w+?)/;
print $1;
h

This code returns the smallest possible match, rather than the greediest.

Now, as I said earlier, \w is (as far as ASCII is concerned) equivalent to the 'character class':

[A-Za-z0-9_]

which is fairly self explanatory: brackets are used to surround a list of characters that comprise the class. Here are some useful(?) classes:

[aeiouAEIOU] # English vowels
[10]         # binary digits
[OIWAHMVX]   # bilaterally symmetrical capital letters

Any quantifier appearing after a character class applies to the whole character class: one or more of any of the characters in the braces:

/[A-Z]+/

Matches one or more capital letters. You can define your own character classes using this notation, but please have a care for those who live outside the comfy world of 7 bits:

$_="El nio";
/(\x{00F1})/ and print "Yep, matched an n-tilde: $1";

The \x{00F1} (which can be abbreviated to \xF1 if this isn't ambiguous) is the Unicode code point of the character. You can also use named characters with the 'charnames' pragma...

use charnames ':full';
$_="    or even ";
/(\N{LATIN SMALL LETTER N WITH TILDE})/ and print "Yep, matched an n-tilde: $1";

For these codes and names, you might want to download Unibook. To save yourself even more time, you can use utf8:

use utf8;
my word = "λόγος";
print "It's all Greek to me\n" if $word =~ /^\w+$/;

This changes the sematics of \w so that it'll match Greek, Arabic, hiragana, hangul, and maybe one day even Egyptian hieroglyphs and tengwar. If this pragma is loaded, it will also allow you to create subroutines with non-ASCII names:

use utf8;
λόγος();
sub λόγος
{
        print "You'll be lucky if 'λόγος' prints correctly in your terminal!\n";
}

Most of the punctuation metacharacters (the characters like + and . and * that mean something special in a regex) lose their meta-nature inside a character class. Usually, you have to escape these metacharacters in a regex:

/\*/;
/ \+ \? /x;

The first will match a literal * character, the second a literal string of +?. But inside a character class, you don't need to bother:

/[*+.]+/;

will match one or more asterisks, periods or plusses: there's no need to escape them, because only a few characters mean something special inside a character class. The characters that do mean something special inside a character class include -, which makes a natural range, as you saw in the definition of \w (hence [A-Z], [a-f], [1-6], [0-9A-Fa-f], etc.), and ^, which means 'anything except…' iff it's the first item in the brackets. So:

/[^U]/;          # anything but the capital letter U
/[^A-Z0-9]/;     # anything but capital letters and numbers
/[A-Z^]/;                         # capital letter or caret
/[^A-Z^]/;       # anything but a capital letter or caret
/[^A-Za-z0-9_]/; # anything but a word character.

Now, that last one could be written more easily as /[^\w]/ or even better as /\W/, the \W being Perl's shorthand for 'anything but an alpha_numeric'. Likewise \S is anything but whitespace, and \D is anything but a digit. If you do want to include a special character like - or ^ in a character class, you'll need to escape it:

/[ \\ \/ \- \] ]/x; # note the x so I can pad them nicely with spaces

This will match a single backslash \ (which you always need to escape in Perl, whether in plain code, regex or in a character class). It will also match a forward slash /, a ] close bracket (this needs escaping, else perl will think it's the end of the character class prematurely) or a hyphen -. You may be wondering about why you have to escape the /. This is for similar reasons to escaping quotes in strings. If you don't escape the regex delimiter /, perl will think the regex finishes in the wrong place. Fortunately for matching path names under Unix, like qq() and q(), you can specify your own regex quotes with m() (for match):

m(\w+?);
m{[\\ / \- \] ]}x;

See that with the second, you no longer need to escape the /. This is very useful in situations where otherwise you'd be writing:

/C:\/perl\/bin\/perl\.exe/;

which is called leaning toothpick syndrome:

m{C:/perl/bin/perl\.exe};

is rather better. As with quoting strings, avoid clever and cute delimiters: stick to slashes, parentheses or braces unless you want the maintainer of your code to come calling with a machete.

What else can you do with regexes? Well, you can specify alternatives:

/foo|bar/;

which will match both foo and bar, using the | or pipe-character. One problem with this is sometimes you'll need to group things using parentheses:

/([Cc]ornelia|my snake) eats (\w+)/;

but now the interesting thing you're trying to capture (what [Cc]ornelia eats) is in $2, not $1, which may be OK, but if you'd rather not have spurious pattern match variables to ignore, you can use the grouping-but-not-capturing (?: ) regex extension:

( $food ) = /(?:[Cc]ornelia|my snake) eats (\w+)/;

The (?: ) allows grouping, but doesn't squirrel away a value into $1 or its friends, so it doesn't interfere with assigning captures to lists. There are dozens of other regex extensions looking like (?...) in Perl regexes, which you can explore yourself (they also make Perl's regular expression highly irregular to computer scientists).

Perl has three special regex punctuation variables. $` $& and $' . These are the pre, actual, and post match variables:

#!/usr/bin/perl
use strict;
use warnings;
my $string =  "Cornelia eats mice that I've thawed on the radiator";
$string    =~ /mice|mouse/;
print "PRE $`\nMATCH $&\nPOST $'\n";
PRE Cornelia eats
MATCH mice
POST that I've thawed on the radiator

Using these three variables will slow down your Perl program, and are almost unreadable, but use them if you must.

One last thing to do is to use what you've already matched, i.e. backreference within a regex. Say you want to find the first bold or italic word in an HTML document:

#!/usr/bin/perl
use strict;
use warnings;
my $html_input_file = shift @ARGV;
local $/ = undef; 
    # this sets the local 'input separator' to nothing, so that
open my $HTML, $html_input_file or die "Bugger: can't open $html_input_file for reading: $!";
$_ = <$HTML>;
    # this will slurp in an entire file, rather than a line at a time
m{
    <(i|b)>
        # an <i> or <b> tag, captured into $1
    (.*?)
        # minimum number of any characters captured into $2
    </\1>
        # an </i> or </b>, depending on the opening tag
}sxi;
        # . matches \n, extended, case insensitive
print "$2\n";

The \1 allows the pattern to match the same something that would end up in $1, here 'b' or 'i'. This isn't written $1 like you'd expect (there is a good but technical reason). This regex (or some variation on it) looks like it will parse HTML. However, it is actually impossible to parse nested languages like HTML or XML without a more complex sort of grammar than can be provided by regexes. Getting around this problem can wait until a (much) later lesson on parsing.

Regexes can be used both directly, and stored for later use using the qr() operator. This q(uote) r(egex) operator is a simple way of keeping regexes and passing them around like strings:

#!/usr/bin/perl
use strict;
use warnings;
my $regex = qr/(?:milli|centi)pedes?/i;
my $text  = "Millipedes are cute. No really.";
print "Found something interesting\n" if $text =~ /$regex/;

You can use $regex wherever you'd usually use a regex (in a match, or a substitution), and you can pass it to subroutines, or use it as part of a larger regex. Note that any modifiers, like /i, are internally incorporated into the string and honoured. You can even print out the $regex as a string. How useful.

Summary

Substituting, splitting, grepping and mapping

Matching patterns is very useful, but often we want to do something more than just match things. What if you want to replace every occurrence of a certain thing with something else? This is the domain of the s/// and tr/// operators. s/// is the substitution operator, and tr/// is the transliteration operator. tr/// is useful for simple things:

#!/usr/bin/perl
use strict;
use warnings;
my $string =  "all lowercase with 5ome num8er5";
$string    =~ tr/a-z/A-Z/;
print $string;
ALL LOWERCASE WITH 5OME NUM8ER5

You just make a list on one side of the tr///, and a list on the other side (hyphens can be used to create natural ranges), and perl will map one lot to the other. The substitution operator is even more powerful and useful:

#!/usr/bin/perl
use strict;
use warnings;
$_ = "old M\$ dross";
s/old/new/i; # substitute any occurrence of old with new, case insensitively
s/M\$/Microsoft/i;
s/dross/loveliness/i;
print; # did you forget print defaults to $_ ?
new Microsoft loveliness

In the second one, note you have to escape the $. This is because both pattern matching and substitution can interpolate variables:

#!/usr/bin/perl
use strict;
use warnings;
my $name   = "Cornelia";
my $string = "Cornelia is a corn snake.";
print "Matched $name\n" if $string =~ /$name/;
$string =~ s{$name}{My snake};
print $string;
Matched Cornelia
My snake is a corn snake.

Note that like m//, s/// and tr/// can use the usual 'any quotes you fancy', although avoid ? and ' , as they have a special significance. So:

s|A|B|;  # three the same
s(A){B}; # two pairs
s{A}|B|; # one pair, two the same

all work, although I'd only recommend the middle one. The s/// can take all the modifiers (/s, /x, /i) that m// can take, but it has another two of its own, /g and /e. /e is like a little eval (we will discuss eval later) that evaluates the substitution's right hand side, and /g means 'globally', i.e. do it to every match you find:

#!/usr/bin/perl
use strict;
use warnings;
my $string =  "2 3 4 5 6";
$string    =~ s/ (\d+) / 2 * $1 /xge; # double every number you match
print $string;
4 6 8 10 12

Clever eh? If you hadn't noticed, when you use a substitution with capture parentheses, the captures are in $1, etc., as usual, and you can use these on the right hand side of the s///. Of course, you can also use /g and /e separately. In fact, you can use /g on m// as well:

$_ = "2 3 4 5 6";
while ( /(\d+)/g ) { print "$1 times 2 is ", $1 * 2, "\n"; }
2 times 2 is 4
3 times 2 is 6
4 times 2 is 8
5 times 2 is 10
6 times 2 is 12

Here, the /g means 'keep matching till you run out of string'.

There are several operators that use pattern matching of one sort or another. The first is split. split expects a list. The first argument is the regex you want to split the string on, the rest of the arguments are things to split. You can capture the split bits in an array:

#!/usr/bin/perl
use strict;
use warnings;
my $string   = "A : colon:delimited: file: with: some : random :spaces";
my ( @bits ) = split /\s*:\s*/, $string;
    # splits on colons surrounded by optional spaces
print "$_\n" foreach @bits;
A
colon
delimited
file
with
some
random
spaces

The opposite of split is join, which has a similar syntax, only it expects not a regex as its first argument, but a string. So:

#!/usr/bin/perl
use strict;
use warnings;
my $joined = join "|", qw/one two three four five six/;
print $joined;
one|two|three|four|five|six

How about this:

#!/usr/bin/perl
use strict;
use warnings;
print join "|", reverse split /\s*:\s*/, 
    "A: colon: delimited  : file: with  :    spaces";
spaces|with|file|delimited|colon|A

Running list operators into each other like this a) is clever, but b) easily becomes unreadable. Caveat scriptor.

Another useful tool for regex is grep. This operator takes a regex as its first argument too, and a list of things to 'grep' as the rest. What is grepping? Well, grepping means 'returning the things that match from a list':

#!/usr/bin/perl
use strict;
use warnings;
my ( @names )     = qw/ Cornelia Atropos Lachetis Amber /;
my ( @match )     = grep   /^A/, @names;
my ( @not_match ) = grep ! /^A/, @names;
print "Start with A @match\nDon't @not_match\n";
Start with A Atropos Amber
Don't Cornelia Lachetis

See that you can make an anti-grep using the ! 'not' before a regex. The way grep actually works is by running through the list you give it, setting $_ to each item in turn. It then uses the regex to pattern match on $_, as usual. Only things that match are returned. grep is useful for finding lines in a file that match a certain pattern. It's another of those Perl operators that returns different values in scalar and list context. In list context (previous example) it return the list of matches, but in scalar context:

my $number = grep /^A/, @names;

it returns the number of matches. grep can be heavily abused, syntactically speaking:

grep /regex/, LIST;
grep { /regex/ } ( LIST );

Both work the same, although I always use the latter, as it makes the condition more obvious. This may vaguely remind you of sort. I prefer the second version, even though it's line noise for its own sake.

One final operator before we leave regexes. map has nothing to do with regexes, but it has a similar syntax to grep (and to sort for that matter). I love map. There's nothing like it for bringing out the mathematician in you. map needs a block of code that does something to $_, followed by a list, just like grep. map then runs though the list, using $_ to cache each value, so you can torture it with the block of code:

@mapped = map { DO_SOMETHING_TO $_ } ( LIST );

So:

#!/usr/bin/perl
use strict;
use warnings;
@doubled = map { 2 * $_ } ( qw/ 2 4 6 8 10 / );
print "@doubled";
4 8 12 16 20

This is shorthand for:

#!/usr/bin/perl
use strict;
use warnings;
@doubled = map { return 2 * $_ } ( qw/ 2 4 6 8 10 / );
print "@doubled";

in case you were wondering: blocks return the last thing they evaluated in the absence of an explicit return statement.

Dull? Yes. But how about:

#!/usr/bin/perl
use strict;
use warnings;
@selective_doubles = 
    map { /[24680]$/ ? ( 2 * $_ ) : $_ } ( qw/ 1 2 3 4 5 6 7 8 / );
print "@selective_doubles";
1 4 3 8 5 12 7 16

which returns a list of numbers that have been doubled iff (if and only if) they are even.

One word of warning for both grep and map. $_ is not a copy of the data in the list you feed to these functions, it's an alias to the actual values of the list. That means that if you modify $_ itself, rather than just returning it, you will alter the items in the list fed to grep or map, not just the items in the returned list. This may be what you want, but probably isn't:

#!/usr/bin/perl
use strict;
use warnings;
my @original = qw/Abacus chocolate sprite/;
print "original: @original\n";
my @returns = map { s/A//gi; } ( @original );
print "afterward: @original\nreturned: @returns\n";
original: Abacus chocolate sprite
afterward: bcus chocolte sprite
returned: 2 1

You may be wondering what the hell has happened. Well, firstly, the actual members of @original have been altered, because s/// messes with $_ directly. Hence all the A characters have been stripped. The s/// operator returns the number of substitutions in scalar context, hence @returns contains 2 (Abacus), 1 (chocolate) and undef (since sprite contains no /A/i). If you remember that a map is basically a foreach loop:

my @mapped = map { DO_SOMETHING_TO $_ } ( LIST );

and

my @mapped;
foreach ( LIST )
{
    my $return_value = DO_SOMETHING_TO $_;
    push @mapped, $return_value;
}

are the same thing, you'll be fine. As long as you remember that altering the value of $_ in a foreach loop indirectly alters the original value in the LIST, that is! Go on, try writing the s/// map as a foreach loop, and you'll see what I mean.

#!/usr/bin/perl
use strict;
use warnings;
my @original = qw/Abacus chocolate sprite/;
print "original: @original\n";
my @returns;
foreach ( @original )
{ my $return_value = s/A//gi; push @returns, $return_value; } print "afterward: @original\nreturned: @returns\n";

Told you so. What you probably need in this case is a temporary variable:

#!/usr/bin/perl
use strict;
use warnings;
my @original = qw/Abacus chocolate sprite/;
print "original: @original\n";
my @returns = map { my $tmp = $_; $tmp =~ s/A//gi; $tmp; } ( @original );
print "afterward: @original\nreturned: @returns\n";
original: Abacus chocolate sprite
afterward: Abacus chocolate sprite
returned: bcus chocolte sprite

Summary

The s/// operator acts like the m// operator, but selectively substitutes text. The tr/// operator is quicker and easier for simple substitutions. The syntax of the new list operators is:

@splat = split /\s/, @splitees;
@junt  = join '+', @joinees;
@mup   = map { $_ * 2 } @mappees;
@grap  = grep { /\d+/ } @grepees;
@argh  = map { "IP: $_" } 
           join '.', split /\:/, 
             grep { /^\d{1,3}:\d{1,3}:\d{1,3}:\d{1,3}$/ } 
               ( @ip );

Test yourself

See if you can write a script that does the following:

#!/usr/bin/perl
use strict;
use warnings;
local $/ = undef; # slurp mode
open my $FILE, "<", "lesson06.html" or die "Can't open file for reading: $!\n";
$_ = <$FILE>; # so we can default match on $_
my ( $keywords ) =
  /<meta \s+ name \s* = \s* "keywords" \s+ content \s* = \s* "([^"]+?)" /sx;
my @keywords = split /\s*,\s*/, $keywords;
print map { ucfirst "$_\n" } @keywords;
print "I counted regex ", scalar( grep {/regex/i} @keywords ), " times\n";

Next…