guava Checking a string for unwanted characters


Example

As a developer, you frequently find yourself dealing with strings that are not created by your own code.

These will often be supplied by third party libraries, external systems, or even end users. Validating strings of unclear provenance is considered to be one of the hallmarks of defensive programming, and in most cases you will want to reject string input that does not meet your expectations.

A fairly common case is where you would only want to allow alphanumeric characters in an input string, so we'll use that as an example. In plain Java, the following two methods both serve the same purpose:

public static boolean isAlphanumeric(String s) {
    for (char c : s.toCharArray()) {
        if (!Character.isLetterOrDigit(c)) {
            return false;
        }
    }

    return true;
}
public static boolean isAlphanumeric(String s) {
    return s.matches("^[0­-9a­-zA­-Z]*$");
}

The first version converts the string to a character array, and then uses the Character class' static isLetterOrDigit method to determine whether the characters contained in the array are alphanumeric or not. This approach is predictable and readable, albeit a little bit verbose.

The second version uses a regular expression to achieve the same purpose. It is more concise, but can be somewhat enigmatic to developers with limited or no knowledge of regular expressions.

Guava introduces the CharMatcher class to deal with these types of situations. Our alphanumeric test, using Guava, would look as follows:

import static com.google.common.base.CharMatcher.javaLetterOrDigit;

/* ... */

public static boolean isAlphanumeric(String s) {
    return javaLetterOrDigit().matchesAllOf(s);
}

The method body contains only one line, but there's actually a lot going on here, so let's break things down a little bit further.

If you take a look at the API of Guava's CharMatcher class, you'll notice that it implements the Predicate<Character> interface. If you would create a class that implements Predicate<Character> yourself, it could look something like this:

import com.google.common.base.Predicate;

public class AlphanumericPredicate implements Predicate<Character> {
    @Override
    public boolean apply(Character c) {
        return Character.isLetterOrDigit(c);
    }
}

In Guava, as in a number of other programming languages and libraries that cater to a functional style of programming, a predicate is a construct that evaluates a given input to either true or false. In Guava's Predicate<T> interface, this is made evident by the presence of the sole boolean apply(T t) method. The CharMatcher class is built on this concept, and will evaluate a character or sequence of characters to check whether or not they match the criteria laid out by the used CharMatcher instance.

Guava currently provides the following predefined character matchers:

MatcherDescription
any()Matches any character.
none()Matches no characters.
javaDigit()Matches digits, according to the Java definition.
javaUpperCase()Matches any upper case character, according to Java's definition.
javaLowerCase()Matches any lower case character, according to Java's definition.
javaLetter()Matches any letter, according to Java's definition.
javaLetterOrDigit()Matches any letter or digit, according to Java's definition.
javaIsoControl()Matches any ISO control character, according to Java's definition.
ascii()Matches any character in the ASCII character set.
invisible()Matches characters that are not visible, according to the Unicode standard.
digit()Matches any digit, according to the Unicode specification.
whitespace()Matches any whitespace character, according to the Unicode specification.
breakingWhitespace()Matches any breaking whitespace character, according to the unicode specification.
singleWidth()Matches any single-­width character.

If you have read through the above table, you've undoubtedly noticed the amount of definition and specification involved in determining which characters belong to a certain category. Guava's approach, so far, has been to provide CharMatcher wrappers for a number of the character categories defined by Java, and you can consult the API of Java's Character class to get more information about these categories. On the other hand, Guava attempts to supply a number of CharMatcher instances that are in line with the current Unicode specification. For the nitty-gritty details, consult the CharMatcher API documentation.

Getting back to our example of checking a string for unwanted characters, the following CharMatcher methods provide the capabilities you need to check whether a given string's character usage meets your requirements:

  • boolean matchesNoneOf(CharSequence sequence)
    Returns true if none of the characters in the argument string match the CharMatcher instance.

  • boolean matchesAnyOf(CharSequence sequence)
    Returns true if at least one character in the argument string matches the CharMatcher instance.

  • boolean matchesAllOf(CharSequence sequence)
    Returns true if all of the characters in the argument string match the CharMatcher instance.