PMotD: Text::Balanced

July 31, 2009

in perl

If you’ve read Jeffrey Friedl’s Mastering Regular Expressions[1. and if you haven’t–you really should. It is a bit light on plot and character development, but it is definitely required reading] you’ll recall that one of the things that is a lot harder than it looks to get right using regexps is extracting the content from within some text-delimited string, especially when there are opening and closing delimiters[2. the other being validating email addresses, but that’s a post for another day].

Text::Balanced originally by Damian Conway and now maintained by Adam Kennedy makes short work of this extraction for:

  • strings delimited by the same single character
  • strings delimited by brackets of some sort (parens, square brackets, etc.)
  • strings delimited by XML or other kinds of tags
  • strings delimited by Perl quoting operators
  • code blocks

Each kind of delimited string has its own function, e.g. extract_bracketed(). For the most part, they all work like this:

my ($extracted,$remainder) = 
        extract_something($extract_from, $delimiter, $opt_prefix);

where you have a function called extract_something() that takes the string to extract the data from, the delimiter that surrounds the data in question, and an optional prefix. That last argument reveals a part of Text::Balanced that tends to confuse people so we’ll look more at it in a second. The extract_something() functions return (in a list context) the first extracted piece of data and what was left over or just the extracted data in a scalar context (with alas, another twist).

There are two things about this module that tend to trip up newcomers:

  1. By default (i.e. without the optional third argument), the extraction functions all expect the data they are going to extract to be found either right at the beginning of the string or right at the position in the string that the last extraction left off. If you expect it to skip over non-delimited data that isn’t just whitespace, you will have to provide that third $opt_prefix argument.
  2. When called in a scalar context, the extraction functions eat the text. Extracted strings are removed from the input text. Fans of functional or immutable data structure programming will not be pleased.

If you do want to do split() on steroids kind of stuff, Text::Balanced also offers a extract_multiple() function that takes a list of extraction functions, each of which gets run over the string, returning what they collectively find. Text::Balanced can also make Friedl proud by generating optimal regular expressions for balanced matches that you can use with a minimum of head scratching.

If you need something to extract data matched delimiters of almost all sorts, this module will be spot on for you and will do its job like a laser beam.

Be Sociable, Share!

Leave a Comment

Previous post:

Next post: