In .NET strings System.String
are sequence of characters System.Char
, each character is an UTF-16 encoded code-unit. This distinction is important because spoken language definition of character and .NET (and many other languages) definition of character are different.
One character, which should be correctly called grapheme, it's displayed as a glyph and it is defined by one or more Unicode code-points. Each code-point is then encoded in a sequence of code-units. Now it should be clear why a single System.Char
does not always represent a grapheme, let's see in real world how they're different:
"à".Length == 2
while you may expect 1
."\u00e0" == "\u0061\u0300"
(even if "\u00e0".Length != "\u0061\u0300".Length
). This is possible because of string normalization performed by String.Normalize()
method.System.Char
("\ud840\udc8a"
) even if it is just one code-point: UTF-16 encoding is not fixed size! This is a source of countless bugs (also serious security bugs), if for example your application applies a maximum length and blindly truncates string at that then you may create an invalid string.There are much more issues about text handling, see for example How can I perform a Unicode aware character by character comparison? for a broader introduction and more links to related arguments.
In general when dealing with international text you may use this simple function to enumerate text elements in a string (avoiding to break Unicode surrogates and encoding):
public static class StringExtensions
{
public static IEnumerable<string> EnumerateCharacters(this string s)
{
if (s == null)
return Enumerable.Empty<string>();
var enumerator = StringInfo.GetTextElementEnumerator(s.Normalize());
while (enumerator.MoveNext())
yield return (string)enumerator.Value;
}
}