guava Strings Removing unwanted characters from a string


Example

The example Checking a string for unwanted characters, describes how to test and reject strings that don't meet certain criteria. Obviously, rejecting input outright is not always possible, and sometimes you just have to make do with what you receive. In these cases, a cautious developer will attempt to sanitize the provided strings to remove any characters that might trip up further processing.

To remove, trim, and replace unwanted characters, the weapon of choice will again be Guava's CharMatcher class.

Removing characters

The two CharMatcher methods of interest in this section are:

  • String retainFrom(CharSequence sequence)
    Returns a string containing all the characters that matched the CharMatcher instance.

  • String removeFrom(CharSequence sequence)
    Returns a string containing all the characters that did not match the CharMatcher instance.

As an example, we'll use CharMatcher.digit(), a predefined CharMatcher instance that, unsurprisingly, only matches digits.

String rock = "1, 2, 3 o'clock, 4 o'clock rock!";

CharMatcher.digit().retainFrom(rock); // "1234"
CharMatcher.digit().removeFrom(rock); // ", , o'clock, o'clock rock!"
CharMatcher.digit().negate().removeFrom(rock); // "1234"

The last line in this example illustrates that removeFrom is actually the inverse operation of retainFrom. Invoking retainFrom on a CharMatcher has the same effect as invoking removeFrom on a negated version of that CharMatcher.

Trimming leading and trailing characters

Removing leading and trailing characters is a very common operation, most frequently used to trim whitespace from strings. Guava's CharMatcher offers these trimming methods:

  • String trimLeadingFrom(CharSequence sequence)
    Removes all leading characters that match the CharMatcher instance.

  • String trimTrailingFrom(CharSequence sequence)
    Removes all trailing characters that match the CharMatcher instance.

  • String trimFrom(CharSequence sequence)
    Removes all leading and trailing characters that match the CharMatcher instance.

When used with CharMatcher.whitespace(), these methods will effectively take care of all your whitespace trimming needs:

CharMatcher.whitespace().trimFrom("   Too much space   "); // returns "Too much space"

Replacing characters

Often, applications will replace characters that are not allowed in a certain situation with a placeholder character. To replace characters in a string, CharMatcher's API provides the following methods:

  • String replaceFrom(CharSequence sequence, char replacement)
    Replaces all occurrences of characters that match the CharMatcher instance with the provided replacement character.

  • String replaceFrom(CharSequence sequence, CharSequence replacement) Replaces all occurrences of characters that match the CharMatcher instance with the provided replacement character sequence (string).

  • String collapseFrom(CharSequence sequence, char replacement)
    Replaces groups of consecutive characters that match the CharMatcher instance with a single instance of the provided replacement character.

  • String trimAndCollapseFrom(CharSequence sequence, char replacement)
    Behaves the same as collapseFrom, but matching groups at the start and the end are removed rather than replaced.

Let's look at an example that demonstrates how the behavior of these methods differs. Say that we're creating an application that lets the user specify output filenames. To sanitize the input provided by the user, we create a CharMatcher instance that is a combination of the predefined whitespace CharMatcher and a custom CharMatcher that specifies a set of characters that we would rather avoid in our filenames.

CharMatcher illegal = CharMatcher.whitespace().or(CharMatcher.anyOf("<>:|?*\"/\\"));

Now, if we invoke the discussed replacement methods as follows on a filename that is in dire need of cleanup:

String filename = "<A::12> first draft???";

System.out.println(illegal.replaceFrom(filename, '_'));
System.out.println(illegal.collapseFrom(filename, '_'));
System.out.println(illegal.trimAndCollapseFrom(filename, '_'));

We'll see the output below in our console.

_A__12___first_draft___
_A_12_first_draft_
A_12_first_draft