guava Finding and counting characters in a string


Example

To help you find and count characters in a string, CharMatcher provides the following methods:

  • int indexIn(CharSequence sequence)
    Returns the index of the first character that matches the CharMatcher instance. Returns -­1 if no character matches.

  • int indexIn(CharSequence sequence, int start)
    Returns the index of the first character after the specified start position that matches the CharMatcher instance. Returns ­-1 if no character matches.

  • int lastIndexIn(CharSequence sequence)
    Returns the index of the last character that matches the CharMatcher instance. Returns ­-1 if no character matches.

  • int countIn(CharSequence sequence)
    Returns the number of characters that match the CharMatcher instance.

Using these methods, here's a simple console application called NonAsciiFinder that takes a string as an input argument. First, it prints out the total number of non­-ASCII characters contained in the string. Subsequently, it prints out the Unicode representation of each non-­ASCII character it encounters. Here's the code:

import com.google.common.base.CharMatcher;

public class NonAsciiFinder {
    private static final CharMatcher NON_ASCII = CharMatcher.ascii().negate();

    public static void main(String[] args) {
        String input = args[0];
        int nonAsciiCount = NON_ASCII.countIn(input);

        echo("Non-ASCII characters found: %d", nonAsciiCount);

        if (nonAsciiCount > 0) {
            int position = -­1;
            char character = 0;

            while (position != NON_ASCII.lastIndexIn(input)) {
                position = NON_ASCII.indexIn(input, position + 1);
                character = input.charAt(position);
                
                echo("%s => \\u%04x", character, (int) character);
            }
        }
    }

    private static void echo(String s, Object... args) {
        System.out.println(String.format(s, args));
    }
}

Note in the above example how you can simply invert a CharMatcher by calling its negate method. Similarly the CharMatcher below matches all double­-width characters and is created by negating the predefined CharMatcher for single-width characters.

final static CharMatcher DOUBLE_WIDTH = CharMatcher.singleWidth().negate();

Running the NonAsciiFinder application produces the following output:

$> java NonAsciiFinder "Maître Corbeau, sur un arbre perché"
Non­-ASCII characters found: 2
î => \u00ee
é => \u00e9
$> java NonAsciiFinder "古池や蛙飛び込む水の音"
Non­ASCII characters found: 11
古 => \u53e4
池 => \u6c60
や => \u3084
蛙 => \u86d9
飛 => \u98db
び => \u3073
込 => \u8fbc
む => \u3080
水 => \u6c34
の => \u306e
音 => \u97f3