Let's say we want to eliminate duplicated subsequence element from a string (it can be more than one). For example:
2,14,14,14,19
and convert it into:
2,14,19
Using gsub
, we can achieve it:
gsub("(\\d+)(,\\1)+","\\1", "2,14,14,14,19")
[1] "2,14,19"
It works also for more than one different repetition, for example:
> gsub("(\\d+)(,\\1)+", "\\1", "2,14,14,14,19,19,20,21")
[1] "2,14,19,20,21"
Let's explain the regular expression:
(\\d+)
: A group 1 delimited by () and finds any digit (at least one). Remember we need to use the double backslash (\\
) here because for a character variable a backslash represents special escape character for literal string delimiters (\"
or \'
). \d\
is equivalent to: [0-9]
.,
: A punctuation sign: ,
(we can include spaces or any other delimiter)\\1
: An identical string to the group 1, i.e.: the repeated number. If that doesn't happen, then the pattern doesn't match.Let's try a similar situation: eliminate consecutive repeated words:
one,two,two,three,four,four,five,six
Then, just replace \d
by \w
, where \w
matches any word character, including:
any letter, digit or underscore. It is equivalent to [a-zA-Z0-9_]
:
> gsub("(\\w+)(,\\1)+", "\\1", "one,two,two,three,four,four,five,six")
[1] "one,two,three,four,five,six"
>
Then, the above pattern includes as a particular case duplicated digits case.