Drawing Regex

Regular Expressions aren't everyone's best friend. Writing a regex can take up a good part of the day and reading a complicated regex is worse. Not to mention debugging. On top of that many regexes are coded by copy-pasting examples from the web and fiddling with them until they give the right result. Now if all coders wrote full and complete test cases for their code this wouldn't be a problem, unfortunately regexes are most used in the kind of hacky environments where this is almost never the case.

But what makes regexes so hard that experienced programmers need to resort to trial-and-error?

Consider the following prime example:

<?php
if (!ereg("^[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+(\.[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+)*$"$local)) return false;
?>

Makes perfect sense right? It's an example from my email address class that tries to verify if a string is a dot-atom: a series of words separated by dots. The words can only contain certain characters and that gives rise to the horrible character class [A-Za-z0-9!#$%&'*+/=?^_`{|}~-]. The initial ^ character indicates the start of the string and the dollar sign at the end represents the string's end. This means the entire string must follow the given pattern to be a valid match.

Simplified, it looks like this::

<?php
START_OF_STRING 
[x]+ (\.[x]+)* END_OF_STRING
?>

Already it makes a bit more sense. Seen like this it's not terribly hard to understand. In fact, most confusion in reading regex stems from their syntax and the additional complications of writing them within a string of another programming language.

Here's a drawing of the same expression:

email regex

Starting from the white dot on the left the only valid next character is one from the character class [x]. From here we have three options: move to the black dot (the end of the string), add another [x] character, or insert a dot. Once we've added a dot we have no other choice but to add another word character and from here we get the same three choices we had before. In fact, the two states are so similar we can compress the diagram as follows:

compact email regex

This is simpler, but doesn't correspond quite as well to our original regex. While it's easy to translate the first diagram back into a regular expression the second one takes a little extra effort.

Another example

The following example shows how to create a regex that parses shell commands and arguments.

The easiest possible example is a one word command. We start at the start of the string, next up is any number of word characters (symbolised by 'w') followed by the string's end:

The regex corresponding to this diagram is \A[\w]+\Z, where \A and \Z denote the start and end of the string.

Now let's be a little more lenient and allow whitespace (written as 's') before and after the command:

The corresponding regex now becomes: \A[\s]*[\w]+[\s]*\Z. The next step is to allow parameters. With a little ingeniuity this can be written as:

But now we run into trouble trying to find the regex translation. To solve it we sacrifice the compactness of the diagram and create the following:

The little box here is a virtual node, a little point where logical flows come together that doesn't correspond to any part of the string. The intuition expressed by this diagram is that the first word is followed either by zero or more parameters, possibly followed by whitespace and ended with an 'end-of-string'. Each parameter must be prefixed by at least a single whitespace. The resulting regex is \A[\s]*[\w]+([\s]+[\w]+)*[\s]*\Z.

And the point is..?

Given a regex and the task to undertand what it says, these drawings are an easy path to success. Reading the regex from left to right you can draw a schematic of what's going on that doesn't depend on syntax or the quirks of whatever programming language the regex is in. Doing this it becomes possible to analyse large regexes, and gain an understanding of what they will allow to pass. Any string you can construct in such a diagram will be accepted by the regex so any invalid matches you find in the diagram will correspond to bugs in the regex. Similarly any string you are unable to produce from the diagram won't be matched by the regex (although it pays to double check in these situations).

When writing regular expressions you can use these diagrams to think out their structure, independent of regex syntax and programming language quirks. Instead of typing trial-and-error regexes directly into your browser you can use the following approach:

  1. Create a diagram. Start at the beginning and follow the simplest path to the pattern's end. Next add any subroutes you can think of, until all possible matches are captured in the diagram.
  2. Translate the drawing into a regex, using whatever regex syntax you feel comfortable with.
  3. Translate the regex into the format required by the environment you're programming in.

This approach avoids having to deal with programming quirks or complicated syntax at the design stage and adds some structure to the way you develop your expressions.

Conclusion

In conclusion making little diagrams is a great way to analyse and debug regular expressions, although creating new ones still requires a feel for what regexes can and cannot do. The main point to take away here is that graphics are easier to deal with than long strings of crazy syntax.

Nov 1st, 2009

Comments

No comments yet! Feel free to post some using the form below.

Post your comments here

If you wish to add code to your comment you can use code tags, like this: <code class="php">yourCodeHere</code>.
Quite a large number of languages are supported, although I can't guarantee it'll be pretty. Inside the code tags you can use any characters except for the string "</code>".