Regular Expressions Why can backtracking be a trap?


Example

Backtracking can be caused by optional quantifiers or alternation constructs, because the regex engine will try to explore every path. If you run the regex a+b against aaaaaaaaaaaaaa there is no match and the engine will find it pretty fast.

But if you change the regex to (aa*)+b the number of combinations will grow pretty fast, and most (not optimized) engines will try to explore all the paths and will take an eternity to try to find a match or throw a timeout exception. This is called catastrophic backtracking.

Of course, (aa*)+b seems a newbie error but it's here to illustrate the point and sometimes you'll end up with the same issue but with more complicated patterns.

A more extreme case of catastrophic backtracking occurs with the regex (x+x+)+y (you've probably seen it before here and here), which needs exponential time to figure out that a string that contains xs and nothing else (e.g xxxxxxxxxxxxxxxxxxxx) don't match it.

How to avoid it?

Be as specific as possible, reduce as much as possible the possible paths. Note that some regex matchers are not vulnerable to backtracking, such as those included in awk or grep because they are based on Thompson NFA.