The utf8
pragma indicates that the source code will be interpreted as UTF-8. Of course, this will only work if your text editor is also saving the source as UTF-8 encoded.
Now, string literals can contain arbitrary Unicode characters; identifiers can also contain Unicode but only word-like characters (see perldata and perlrecharclass for more information):
use utf8;
my $var1 = '§я§©😄'; # works fine
my $я = 4; # works since я is a word (matches \w) character
my $p§2 = 3; # does not work since § is not a word character.
say "ya" if $var1 =~ /я§/; # works fine (prints "ya")
Note: When printing text to the terminal, make sure it supports UTF-8.*
There may be complex and counter-intuitive relationships between output and source encoding. Running on a UTF-8 terminal, you may find that adding the utf8
pragma seems to break things:
$ perl -e 'print "Møøse\n"'
Møøse
$ perl -Mutf8 -e 'print "Møøse\n"'
M��se
$ perl -Mutf8 -CO -e 'print "Møøse\n"'
Møøse
In the first case, Perl treats the string as raw bytes and prints them like that. As these bytes happen to be valid UTF-8, they look correct even though Perl doesn't really know what characters they are (e.g. length("Møøse")
will return 7, not 5). Once you add -Mutf8
, Perl correctly decodes the UTF-8 source to characters, but output is in Latin-1 mode by default and printing Latin-1 to a UTF-8 terminal doesn't work. Only when you switch STDOUT
to UTF-8 using -CO
will the output be correct.
use utf8
doesn't affect standard I/O encoding nor file handles!