When reading UTF-8 encoded data, it is important to be aware of the fact the UTF-8 encoded data can be invalid or malformed. Such data should usually not be accepted by your program (unless you know what you are doing). When unexpectedly encountering malformed data, different actions can be considered:
By default, Perl will warn
you about encoding glitches, but it will not abort your program.
You can make your program abort by making UTF-8 warnings fatal, but be aware of the caveats in Fatal Warnings.
The following example writes 3 bytes in encoding ISO 8859-1 to disk. It then tries to read the bytes back again as UTF-8 encoded data. One of the bytes, 0xE5
, is an invalid UTF-8 one byte sequence:
use strict;
use warnings;
use warnings FATAL => 'utf8';
binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $bytes = "\x{61}\x{E5}\x{61}"; # 3 bytes in iso 8859-1: aåa
my $fn = 'test.txt';
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;
open ( $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
my $str = do { local $/; <$fh> };
close $fh;
print "Read string: '$str'\n";
The program will abort with a fatal warning:
utf8 "\xE5" does not map to Unicode at ./test.pl line 10.
Line 10 is here the second last line, and the error occurs in the part of the line with <$fh>
when trying to read a line from the file.
If you don't make warnings fatal in the above program, Perl will still print the warning. However, in this case it will try to recover from the malformed byte 0xE5
by inserting the four characters \xE5
into the stream, and then continue with the next byte. As a result, the program will print:
Read string: 'a\xE5a'