Perl Language Unicode Handling invalid UTF-8

Help us to keep this website almost Ad Free! It takes only 10 seconds of your time:
> Step 1: Go view our video on YouTube: EF Core Bulk Insert
> Step 2: And Like the video. BONUS: You can also share it!

Example

Reading invalid UTF-8

When reading UTF-8 encoded data, it is important to be aware of the fact the UTF-8 encoded data can be invalid or malformed. Such data should usually not be accepted by your program (unless you know what you are doing). When unexpectedly encountering malformed data, different actions can be considered:

  • Print stacktrace or error message, and abort program gracefully, or
  • Insert a substitution character at the place where the malformed byte sequence appeared, print a warning message to STDERR and continue reading as nothing happened.

By default, Perl will warn you about encoding glitches, but it will not abort your program. You can make your program abort by making UTF-8 warnings fatal, but be aware of the caveats in Fatal Warnings.

The following example writes 3 bytes in encoding ISO 8859-1 to disk. It then tries to read the bytes back again as UTF-8 encoded data. One of the bytes, 0xE5, is an invalid UTF-8 one byte sequence:

use strict;
use warnings;
use warnings FATAL => 'utf8';

binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';
my $bytes = "\x{61}\x{E5}\x{61}";  # 3 bytes in iso 8859-1: aåa
my $fn = 'test.txt';
open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
print $fh $bytes;
close $fh;
open ( $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
my $str = do { local $/; <$fh> };
close $fh;
print "Read string: '$str'\n";

The program will abort with a fatal warning:

utf8 "\xE5" does not map to Unicode at ./test.pl line 10.

Line 10 is here the second last line, and the error occurs in the part of the line with <$fh> when trying to read a line from the file.

If you don't make warnings fatal in the above program, Perl will still print the warning. However, in this case it will try to recover from the malformed byte 0xE5 by inserting the four characters \xE5 into the stream, and then continue with the next byte. As a result, the program will print:

Read string: 'a\xE5a'


Got any Perl Language Question?