Let's say we want to eliminate duplicated subsequence element from a string (it can be more than one). For example:
2,14,14,14,19
and convert it into:
2,14,19
Using gsub, we can achieve it:
gsub("(\\d+)(,\\1)+","\\1", "2,14,14,14,19")
[1] "2,14,19"
It works also for more than one different repetition, for example:
> gsub("(\\d+)(,\\1)+", "\\1", "2,14,14,14,19,19,20,21")
[1] "2,14,19,20,21"
Let's explain the regular expression:
(\\d+): A group 1 delimited by () and finds any digit (at least one). Remember we need to use the double backslash (\\) here because for a character variable a backslash represents special escape character for literal string delimiters (\" or \'). \d\ is equivalent to: [0-9].,: A punctuation sign: , (we can include spaces or any other delimiter)\\1: An identical string to the group 1, i.e.: the repeated number. If that doesn't happen, then the pattern doesn't match.Let's try a similar situation: eliminate consecutive repeated words:
one,two,two,three,four,four,five,six
Then, just replace \d by \w, where \w matches any word character, including:
any letter, digit or underscore. It is equivalent to [a-zA-Z0-9_]:
> gsub("(\\w+)(,\\1)+", "\\1", "one,two,two,three,four,four,five,six")
[1] "one,two,three,four,five,six"
>
Then, the above pattern includes as a particular case duplicated digits case.