Strings

Download .net eBook

Remarks

In .NET strings System.String are sequence of characters System.Char, each character is an UTF-16 encoded code-unit. This distinction is important because spoken language definition of character and .NET (and many other languages) definition of character are different.

One character, which should be correctly called grapheme, it's displayed as a glyph and it is defined by one or more Unicode code-points. Each code-point is then encoded in a sequence of code-units. Now it should be clear why a single System.Char does not always represent a grapheme, let's see in real world how they're different:

  • One grapheme, because of combining characters, may result in two or more code-points: is composed by two code-points: U+0061 LATIN SMALL LETTER A and U+0300 COMBINING GRAVE ACCENT. This is the most common mistake because "à".Length == 2 while you may expect 1.
  • There are duplicated characters, for example à may be a single code-point U+00E0 LATIN SMALL LETTER A WITH GRAVE or two code-points as explained above. Obviously they must compare the same: "\u00e0" == "\u0061\u0300" (even if "\u00e0".Length != "\u0061\u0300".Length). This is possible because of string normalization performed by String.Normalize() method.
  • An Unicode sequence may contain a composed or decomposed sequence, for example character U+D55C HAN CHARACTER may be a single code-point (encoded as a single code-unit in UTF-16) or a decomposed sequence of its syllables , and . They must be compared equal.
  • One code-point may be encoded to more than one code-units: character 𠂊 U+2008A HAN CHARACTER is encoded as two System.Char ("\ud840\udc8a") even if it is just one code-point: UTF-16 encoding is not fixed size! This is a source of countless bugs (also serious security bugs), if for example your application applies a maximum length and blindly truncates string at that then you may create an invalid string.
  • Some languages have digraph and trigraphs, for example in Czech ch is a standalone letter (after h and before i then when ordering a list of strings you will have fyzika before chemie.

There are much more issues about text handling, see for example How can I perform a Unicode aware character by character comparison? for a broader introduction and more links to related arguments.

In general when dealing with international text you may use this simple function to enumerate text elements in a string (avoiding to break Unicode surrogates and encoding):

public static class StringExtensions
{
    public static IEnumerable<string> EnumerateCharacters(this string s)
    {
        if (s == null)
            return Enumerable.Empty<string>();

        var enumerator = StringInfo.GetTextElementEnumerator(s.Normalize());
        while (enumerator.MoveNext())
            yield return (string)enumerator.Value;
    }
}

Related Examples

Stats

110 Contributors: 5
Wednesday, February 22, 2017
Licensed under: CC-BY-SA

Not affiliated with Stack Overflow
Rip Tutorial: info@zzzprojects.com

Download eBook