Julia Language Graphemes


Example

Julia's Char type represents a Unicode scalar value, which only in some cases corresponds to what humans perceive as a "character". For instance, one representation of the character é, as in résumé, is actually a combination of two Unicode scalar values:

julia> collect("é")
2-element Array{Char,1}:
 'e'
 '́'

The Unicode descriptions for these codepoints are "LATIN SMALL LETTER E" and "COMBINING ACUTE ACCENT". Together, they define a single "human" character, which is Unicode terms is called a grapheme. More specifically, Unicode Annex #29 motivates the definition of a grapheme cluster because:

It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + acute-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

Julia provides the graphemes function to iterate over the grapheme clusters in a string:

julia> for c in graphemes("résumé")
           println(c)
       end
r
é
s
u
m
é

Note how the result, printing each character on its own line, is better than if we had iterated over the Unicode scalar values:

julia> for c in "résumé"
           println(c)
       end
r
e

s
u
m
e

Typically, when working with characters in a user-perceived sense, it is more useful to deal with grapheme clusters than with Unicode scalar values. For instance, suppose we want to write a function to compute the length of a single word. A naïve solution would be to use

julia> wordlength(word) = length(word)
wordlength (generic function with 1 method)

We note that the result is counter-intuitive when the word includes grapheme clusters that consist of more than one codepoint:

julia> wordlength("résumé")
8

When we use the more correct definition, using the graphemes function, we get the expected result:

julia> wordlength(word) = length(graphemes(word))
wordlength (generic function with 1 method)

julia> wordlength("résumé")
6