Regular Expressions Named Capture Groups


Example

Some regular expression flavors allow named capture groups. Instead of by a numerical index you can refer to these groups by name in subsequent code, i.e. in backreferences, in the replace pattern as well as in the following lines of the program.

Numerical indexes change as the number or arrangement of groups in an expression changes, so they are more brittle in comparison.

For example, to match a word (\w+) enclosed in either single or double quotes (['"]), we could use:

(?<quote>['"])\w+\k{quote}

Which is equivalent to:

(['"])\w+\1

In a simple situation like this a regular, numbered capturing group does not have any draw-backs.

In more complex situations the use of named groups will make the structure of the expression more apparent to the reader, which improves maintainability.

Log file parsing is an example of a more complex situation that benefits from group names. This is the Apache Common Log Format (CLF):

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

The following expression captures the parts into named groups:

(?<ip>\S+) (?<logname>\S+) (?<user>\S+) (?<time>\[[^]]+\]) (?<request>"[^"]+") (?<status>\S+) (?<bytes>\S+)

The syntax depends on the flavor, common ones are:

  • (?<name>...)
  • (?'name'...)
  • (?P<name>...)

Backreferences:

  • \k<name>
  • \k{name}
  • \k'name'
  • \g{name}
  • (?P=name)

In the .NET flavor you can have several groups sharing the same name, they will use capture stacks.

In PCRE you have to explicitly enable it by using the (?J) modifier (PCRE_DUPNAMES), or by using the branch reset group (?|). Only the last captured value will be accessible though.

(?J)(?<a>...)(?<a>...)
(?|(?<a>...)|(?<a>...))