1. Character Class
Character class is denoted by []
. Content inside a character class is treated as single character separately
. e.g. suppose we use
[12345]
In the example above, it means match 1 or 2 or 3 or 4 or 5
. In simple words, it can be understood as or condition for single characters
(stress on single character)
1.1 Word of caution
[cat]
, it does not mean that it should match the word cat
literally but it means that it should match either c
or a
or t
. This is a very common misunderstanding existing among people who are newer to regex.|
(alternation) inside character class thinking it will act as OR condition
which is wrong. e.g. using [a|b]
actually means match a
or |
(literally) or b
.2. Range in character class
Range in character class is denoted using -
sign. Suppose we want to find any character within English alphabets A
to Z
. This can be done by using the following character class
[A-Z]
This could be done for any valid ASCII or unicode range. Most commonly used ranges include [A-Z]
, [a-z]
or [0-9]
. Moreover these ranges can be combined in character class as
[A-Za-z0-9]
This means that match any character in the range A to Z
or a to z
or 0 to 9
. The ordering can be anything. So the above is equivalent to [a-zA-Z0-9]
as long as the range you define is correct.
2.1 Word of caution
Sometimes when writing ranges for A
to Z
people write it as [A-z]
. This is wrong in most cases because we are using z
instead of Z
. So this denotes match any character from ASCII range 65
(of A) to 122
(of z) which includes many unintended character after ASCII range 90
(of Z). HOWEVER, [A-z]
can be used to match all [a-zA-Z]
letters in POSIX-style regex when collation is set for a particular language.
[[ "ABCEDEF[]_abcdef" =~ ([A-z]+) ]] && echo "${BASH_REMATCH[1]}"
on Cygwin with LC_COLLATE="en_US.UTF-8"
yields ABCEDF
.
If you set LC_COLLATE
to C
(on Cygwin, done with export
), it will give the expected ABCEDEF[]_abcdef
.
Meaning of -
inside character class is special. It denotes range as explained above. What if we want to match -
character literally? We can't put it anywhere otherwise it will denote ranges if it is put between two characters. In that case we have to put -
in starting of character class like [-A-Z]
or in end of character class like [A-Z-]
or escape it
if you want to use it in middle like [A-Z\-a-z]
.
3. Negated character class
Negated character class is denoted by [^..]
. The caret sign ^
denotes match any character except the one present in character class. e.g.
[^cat]
means match any character except c
or a
or t
.
3.1 Word of caution
^
maps to negation only if its in the starting of character class. If its anywhere else in character class it is treated as literal caret character without any special meaning.[^]
. In most regex engines, this gives an error. The reason being when you are using ^
in the starting position, it expects at least one character that should be negated. In JavaScript though, this is a valid construct matching anything but nothing, i.e. matches any possible symbol (but diacritics, at least in ES5).