To help you find and count characters in a string, CharMatcher
provides the following methods:
int indexIn(CharSequence sequence)
Returns the index of the first character that matches the CharMatcher
instance. Returns -1 if no character matches.
int indexIn(CharSequence sequence, int start)
Returns the index of the first character after the specified start position that matches the CharMatcher
instance. Returns -1 if no character matches.
int lastIndexIn(CharSequence sequence)
Returns the index of the last character that matches the CharMatcher
instance. Returns -1 if no character matches.
int countIn(CharSequence sequence)
Returns the number of characters that match the CharMatcher
instance.
Using these methods, here's a simple console application called NonAsciiFinder
that takes a string as an input argument. First, it prints out the total number of non-ASCII characters contained in the string.
Subsequently, it prints out the Unicode representation of each non-ASCII character it encounters. Here's the code:
import com.google.common.base.CharMatcher;
public class NonAsciiFinder {
private static final CharMatcher NON_ASCII = CharMatcher.ascii().negate();
public static void main(String[] args) {
String input = args[0];
int nonAsciiCount = NON_ASCII.countIn(input);
echo("Non-ASCII characters found: %d", nonAsciiCount);
if (nonAsciiCount > 0) {
int position = -1;
char character = 0;
while (position != NON_ASCII.lastIndexIn(input)) {
position = NON_ASCII.indexIn(input, position + 1);
character = input.charAt(position);
echo("%s => \\u%04x", character, (int) character);
}
}
}
private static void echo(String s, Object... args) {
System.out.println(String.format(s, args));
}
}
Note in the above example how you can simply invert a CharMatcher
by calling its negate
method. Similarly the CharMatcher
below matches all double-width characters and is created by negating the predefined CharMatcher
for single-width characters.
final static CharMatcher DOUBLE_WIDTH = CharMatcher.singleWidth().negate();
Running the NonAsciiFinder
application produces the following output:
$> java NonAsciiFinder "Maître Corbeau, sur un arbre perché"
Non-ASCII characters found: 2
î => \u00ee
é => \u00e9
$> java NonAsciiFinder "古池や蛙飛び込む水の音"
NonASCII characters found: 11
古 => \u53e4
池 => \u6c60
や => \u3084
蛙 => \u86d9
飛 => \u98db
び => \u3073
込 => \u8fbc
む => \u3080
水 => \u6c34
の => \u306e
音 => \u97f3