Java Language Pitfall - Efficiency concerns with regular expressions


Example

Regular expression matching is a powerful tool (in Java, and in other contexts) but it does have some drawbacks. One of these that regular expressions tends to be rather expensive.

Pattern and Matcher instances should be reused

Consider the following example:

/**
 * Test if all strings in a list consist of English letters and numbers.
 * @param strings the list to be checked
 * @return 'true' if an only if all strings satisfy the criteria
 * @throws NullPointerException if 'strings' is 'null' or a 'null' element.
 */
public boolean allAlphanumeric(List<String> strings) {
    for (String s : strings) {
        if (!s.matches("[A-Za-z0-9]*")) {
            return false;
        }  
    }
    return true;
}

This code is correct, but it is inefficient. The problem is in the matches(...) call. Under the hood, s.matches("[A-Za-z0-9]*") is equivalent to this:

Pattern.matches(s, "[A-Za-z0-9]*")

which is in turn equivalent to

Pattern.compile("[A-Za-z0-9]*").matcher(s).matches()

The Pattern.compile("[A-Za-z0-9]*") call parses the regular expression, analyze it, and construct a Pattern object that holds the data structure that will be used by the regex engine. This is a non-trivial computation. Then a Matcher object is created to wrap the s argument. Finally we call match() to do the actual pattern matching.

The problem is that this work is all repeated for each loop iteration. The solution is to restructure the code as follows:

private static Pattern ALPHA_NUMERIC = Pattern.compile("[A-Za-z0-9]*");

public boolean allAlphanumeric(List<String> strings) {
    Matcher matcher = ALPHA_NUMERIC.matcher("");
    for (String s : strings) {
        matcher.reset(s);
        if (!matcher.matches()) {
            return false;
        }  
    }
    return true;
}

Note that the javadoc for Pattern states:

Instances of this class are immutable and are safe for use by multiple concurrent threads. Instances of the Matcher class are not safe for such use.

Don't use match() when you should use find()

Suppose you want to test if a string s contains three or more digits in a row. You cn express this in various ways including:

  if (s.matches(".*[0-9]{3}.*")) {
      System.out.println("matches");
  }

or

  if (Pattern.compile("[0-9]{3}").matcher(s).find()) {
      System.out.println("matches");
  }

The first one is more concise, but it is also likely to be less efficient. On the face of it, the first version is going to try to match the entire string against the pattern. Furthermore, since ".*" is a "greedy" pattern, the pattern matcher is likely to advance "eagerly" try to the end of the string, and backtrack until it finds a match.

By contrast, the second version will search from left to right and will stop searching as soon as it finds the 3 digits in a row.

Use more efficient alternatives to regular expressions

Regular expressions are a powerful tool, but they should not be your only tool. A lot of tasks can be done more efficiently in other ways. For example:

 Pattern.compile("ABC").matcher(s).find()

does the same thing as:

 s.contains("ABC")

except that the latter is a lot more efficient. (Even if you can amortize the cost of compiling the regular expression.)

Often, the non-regex form is more complicated. For example, the test performed by the matches() call the earlier allAlplanumeric method can be rewritten as:

 public boolean matches(String s) {
     for (char c : s) {
         if ((c >= 'A' && c <= 'Z') ||
             (c >= 'a' && c <= 'z') ||
             (c >= '0' && c <= '9')) {
              return false;
         }
     }
     return true;
 }

Now that is more code than using a Matcher, but it is also going to be significantly faster.

Catastrophic Backtracking

(This is potentially a problem with all implementations of regular expressions, but we will mention it here because it is a pitfall for Pattern usage.)

Consider this (contrived) example:

Pattern pat = Pattern.compile("(A+)+B");
System.out.println(pat.matcher("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB").matches());
System.out.println(pat.matcher("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC").matches());

The first println call will quickly print true. The second one will print false. Eventually. Indeed, if you experiment with the code above, you will see that each time you add an A before the C, the time take will double.

This is behavior is an example of catastrophic backtracking. The pattern matching engine that implements the regex matching is fruitlessly trying all of the possible ways that the pattern might match.

Let us look at what (A+)+B actually means. Superficially, it seems to say "one or more A characters followed by a B value", but in reality it says one or more groups, each of which consists of one or more A characters. So, for example:

  • 'AB' matches one way only: '(A)B'
  • 'AAB' matches two ways: '(AA)B' or '(A)(A)B`
  • 'AAAB' matches four ways: '(AAA)B' or '(AA)(A)Bor '(A)(AA)B or '(A)(A)(A)B`
  • and so on

In other words, the number of possible matches is 2N where N is the number of A characters.

The above example is clearly contrived, but patterns that exhibit this kind of performance characteristics (i.e. O(2^N) or O(N^K) for a large K) arise frequently when ill-considered regular expressions are used. There are some standard remedies:

  • Avoid nesting repeating patterns within other repeating patterns.
  • Avoid using too many repeating patterns.
  • Use non-backtracking repetition as appropriate.
  • Don't use regexes for complicated parsing tasks. (Write a proper parser instead.)

Finally, beware of situations where a user or an API client can supply a regex string with pathological characteristics. That can lead to accidental or deliberate "denial of service".

References: