You should verify every received string as being valid UTF-8 before you try to store it or use it anywhere. PHP's mb_check_encoding()
does the trick, but you have to use it consistently. There's really no way around this, as malicious clients can submit data in whatever encoding they want.
$string = $_REQUEST['user_comment'];
if (!mb_check_encoding($string, 'UTF-8')) {
// the string is not UTF-8, so re-encode it.
$actualEncoding = mb_detect_encoding($string);
$string = mb_convert_encoding($string, 'UTF-8', $actualEncoding);
}
If you're using HTML5 then you can ignore this last point. You want all data sent to you by browsers to be in UTF-8. The only reliable way to do this is to add the accept-charset
attribute to all of your <form>
tags like so:
<form action="somepage.php" accept-charset="UTF-8">