Perl LanguageUnicode


Remarks

A Warning on Filename Encoding


It should be worth mentioning that Filename Encoding is not only platform specific but also filesystem specific.

It is never entirely safe to assume (but often usually is) that just because you can encode and write to a given filename, that when you later try to open that same filename for reading, it will still be called the same thing.

For instance, if you write to a filesystem such as FAT16 which doesn't support unicode, your filenames might silently get translated into ASCII-compatible forms.

But it is even less safe to assume that a file you can create, read and write to by explicit naming will be called the same thing when queried through other calls, for instance, readdir might return different bytes for your filename than you specified to open.

On some systems such as VAX, you can't even always assume that readdir will return the same filename you specified with open for filenames as simple as foo.bar, because filename extensions can be mangled by the OS.

Also, on UNIX, there is a very liberal set of legal characters for filenames that the OS permits, excluding only / and \0, where as on Windows, there are specific ranges of characters that are forbidden in filenames and will cause errors.

Exercise much caution here, avoid fancy tricks with filenames if you have a choice, and always have tests to make sure any fancy tricks you do use are consistent.

Exercise doubly as much caution if you're writing code intended to be run on platforms outside your control, such as if you're writing code that is intended for CPAN, and assume at least 5% of your user base will be stuck using some ancient or broken technology, either by choice, by accident, or by powers outside their control, and that these will conspire to create bugs for them.

:encoding(utf8) vs :utf8


Since UTF-8 is one of the internal formats for representation of strings in Perl, the encoding/decoding step may often be skipped. Instead of :encoding(utf-8), you can simply use :utf8, if your data is already in UTF-8. :utf8 can be used safely with output streams, whereas for input stream it can be dangerous, because it causes internal inconsistency when you have invalid byte sequences. Also, using :utf8 for input may result in security breaches, so the use of :encoding(utf-8) is advisable.

More details: What is the difference between :encoding and :utf8

UTF-8 vs utf8 vs UTF8


As of Perl v5.8.7, "UTF-8" (with dash) means UTF-8 in its strict and security-conscious form, whereas "utf8" means UTF-8 in its liberal and loose form.

For example, "utf8" can be used for code points that don't exist in Unicode, like 0xFFFFFFFF. Correspondingly, invalid UTF-8 byte sequences like "\x{FE}\x{83}\x{BF}\x{BF}\x{BF}\x{BF}\x{BF}" will decode into an invalid Unicode (but valid Perl) codepoint (0xFFFFFFFF) when using "utf8", whereas the "UTF-8" encoding would not allow decoding to codepoints outside the range of valid Unicode and would give you a substitution character (0xFFFD) instead.

Since encoding names are case insensitive, "UTF8" is the same as "utf8" (i.e. non-strict variant).

More details: UTF-8 vs. utf8 vs. UTF8

More Reading


Details about Perl's Unicode handling is described in more detail in the following sources:

Posts from stackoverflow.com (caveat: might not be up-to-date):

Youtube videos: