R Language Modifying strings by substitution Eliminate duplicated consecutive elements


Let's say we want to eliminate duplicated subsequence element from a string (it can be more than one). For example:


and convert it into:


Using gsub, we can achieve it:

gsub("(\\d+)(,\\1)+","\\1", "2,14,14,14,19")
[1] "2,14,19"

It works also for more than one different repetition, for example:

 > gsub("(\\d+)(,\\1)+", "\\1", "2,14,14,14,19,19,20,21")
[1] "2,14,19,20,21"

Let's explain the regular expression:

  1. (\\d+): A group 1 delimited by () and finds any digit (at least one). Remember we need to use the double backslash (\\) here because for a character variable a backslash represents special escape character for literal string delimiters (\" or \'). \d\ is equivalent to: [0-9].
  2. ,: A punctuation sign: , (we can include spaces or any other delimiter)
  3. \\1: An identical string to the group 1, i.e.: the repeated number. If that doesn't happen, then the pattern doesn't match.

Let's try a similar situation: eliminate consecutive repeated words:


Then, just replace \d by \w, where \w matches any word character, including: any letter, digit or underscore. It is equivalent to [a-zA-Z0-9_]:

> gsub("(\\w+)(,\\1)+", "\\1", "one,two,two,three,four,four,five,six")
[1] "one,two,three,four,five,six"

Then, the above pattern includes as a particular case duplicated digits case.