Regular Expressions Capture Groups Basic Capture Groups


A group is a section of a regular expression enclosed in parentheses (). This is commonly called "sub-expression" and serves two purposes:

  • It makes the sub-expression atomic, i.e. it will either match, fail or repeat as a whole.
  • The portion of text it matched is accessible in the remainder of the expression and the rest of the program.

Groups are numbered in regex engines, starting with 1. Traditionally, the maximum group number is 9, but many modern regex flavors support higher group counts. Group 0 always matches the entire pattern, the same way surrounding the entire regex with brackets would.

The ordinal number increases with each opening parenthesis, regardless of whether the groups are placed one-after-another or nested:

foo(bar(baz)?) (qux)+|(bla)
   1   2       3      4

groups and their numbers

After an expression achieves an overall match, all of its groups will be in use - whether a particular group has managed to match anything or not.

A group can be optional, like (baz)? above, or in an alternative part of the expression that was not used of the match, like (bla) above. In these cases, non-matching groups simply won't contain any information.

If a quantifier is placed behind a group, like in (qux)+ above, the overall group count of the expression stays the same. If a group matches more than once, its content will be the last match occurrence. However, modern regex flavors allow accessing all sub-match occurrences.

If you wished to retrieve the date and error level of a log entry like this one:

2012-06-06 12:12.014 ERROR: Failed to connect to remote end

You could use something like this:

^(\d{4}-\d{2}-\d{2}) \d{2}:\d{2}.\d{3} (\w*): .*$

This would extract the date of the log entry 2012-06-06 as capture group 1 and the error level ERROR as capture group 2.