Regular Expressions Ambiguous Backreferences


Problem: You need to match text of a certain format, for example:

4 g 0

That's a digit, a separator (one of -, /, or a space), a letter, the same separator, and a zero.

Naïve solution: Adapting the regex from the Basics example, you come up with this regex:

[0-9]([-/ ])[a-z]\10

But that probably won't work. Most regex flavors support more than nine capturing groups, and very few of them are smart enough to realize that, since there's only one capturing group, \10 must be a backreference to group 1 followed by a literal 0. Most flavors will treat it as a backreference to group 10. A few of those will throw an exception because there is no group 10; the rest will simply fail to match.

There are several ways to avoid this problem. One is to use named groups (and named backreferences):

[0-9](?<sep>[-/ ])[a-z]\k<sep>0

If your regex language supports it, the format \g{n} (where n is a number) can enclose the backreference number in curly brackets to separate it from any digits after it:

[0-9]([-/ ])[a-z]\g{1}0

Another way is to use extended regex formatting, separating the elements with insignificant whitespace (in Java you'll need to escape the space in the brackets):

(?x) [0-9] ([-/ ]) [a-z] \1 0

If your regex flavor doesn't support those features, you can add unnecessary but harmless syntax, like a non-capturing group:

[0-9]([-/ ])[a-z](?:\1)0

...or a dummy quantifier (this is possibly the only circumstance in which {1} is useful):

[0-9]([-/ ])[a-z]\1{1}0