Regular expression matching is a powerful tool (in Java, and in other contexts) but it does have some drawbacks. One of these that regular expressions tends to be rather expensive.
Consider the following example:
/**
* Test if all strings in a list consist of English letters and numbers.
* @param strings the list to be checked
* @return 'true' if an only if all strings satisfy the criteria
* @throws NullPointerException if 'strings' is 'null' or a 'null' element.
*/
public boolean allAlphanumeric(List<String> strings) {
for (String s : strings) {
if (!s.matches("[A-Za-z0-9]*")) {
return false;
}
}
return true;
}
This code is correct, but it is inefficient. The problem is in the matches(...)
call. Under the hood, s.matches("[A-Za-z0-9]*")
is equivalent to this:
Pattern.matches(s, "[A-Za-z0-9]*")
which is in turn equivalent to
Pattern.compile("[A-Za-z0-9]*").matcher(s).matches()
The Pattern.compile("[A-Za-z0-9]*")
call parses the regular expression, analyze it, and construct a Pattern
object that holds the data structure that will be used by the regex engine. This is a non-trivial computation. Then a Matcher
object is created to wrap the s
argument. Finally we call match()
to do the actual pattern matching.
The problem is that this work is all repeated for each loop iteration. The solution is to restructure the code as follows:
private static Pattern ALPHA_NUMERIC = Pattern.compile("[A-Za-z0-9]*");
public boolean allAlphanumeric(List<String> strings) {
Matcher matcher = ALPHA_NUMERIC.matcher("");
for (String s : strings) {
matcher.reset(s);
if (!matcher.matches()) {
return false;
}
}
return true;
}
Note that the javadoc for Pattern
states:
Instances of this class are immutable and are safe for use by multiple concurrent threads. Instances of the
Matcher
class are not safe for such use.
Suppose you want to test if a string s
contains three or more digits in a row. You cn express this in various ways including:
if (s.matches(".*[0-9]{3}.*")) {
System.out.println("matches");
}
or
if (Pattern.compile("[0-9]{3}").matcher(s).find()) {
System.out.println("matches");
}
The first one is more concise, but it is also likely to be less efficient. On the face of it, the first version is going to try to match the entire string against the pattern. Furthermore, since ".*" is a "greedy" pattern, the pattern matcher is likely to advance "eagerly" try to the end of the string, and backtrack until it finds a match.
By contrast, the second version will search from left to right and will stop searching as soon as it finds the 3 digits in a row.
Regular expressions are a powerful tool, but they should not be your only tool. A lot of tasks can be done more efficiently in other ways. For example:
Pattern.compile("ABC").matcher(s).find()
does the same thing as:
s.contains("ABC")
except that the latter is a lot more efficient. (Even if you can amortize the cost of compiling the regular expression.)
Often, the non-regex form is more complicated. For example, the test performed by the matches()
call the earlier allAlplanumeric
method can be rewritten as:
public boolean matches(String s) {
for (char c : s) {
if ((c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
(c >= '0' && c <= '9')) {
return false;
}
}
return true;
}
Now that is more code than using a Matcher
, but it is also going to be significantly faster.
(This is potentially a problem with all implementations of regular expressions, but we will mention it here because it is a pitfall for Pattern
usage.)
Consider this (contrived) example:
Pattern pat = Pattern.compile("(A+)+B");
System.out.println(pat.matcher("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB").matches());
System.out.println(pat.matcher("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC").matches());
The first println
call will quickly print true
. The second one will print false
. Eventually. Indeed, if you experiment with the code above, you will see that each time you add an A
before the C
, the time take will double.
This is behavior is an example of catastrophic backtracking. The pattern matching engine that implements the regex matching is fruitlessly trying all of the possible ways that the pattern might match.
Let us look at what (A+)+B
actually means. Superficially, it seems to say "one or more A
characters followed by a B
value", but in reality it says one or more groups, each of which consists of one or more A
characters. So, for example:
or '(A)(AA)B
or '(A)(A)(A)B`In other words, the number of possible matches is 2N where N is the number of A
characters.
The above example is clearly contrived, but patterns that exhibit this kind of performance characteristics (i.e. O(2^N)
or O(N^K)
for a large K
) arise frequently when ill-considered regular expressions are used. There are some standard remedies:
Finally, beware of situations where a user or an API client can supply a regex string with pathological characteristics. That can lead to accidental or deliberate "denial of service".
References: