Voting

: four plus four?
(Example: nine)

The Note You're Voting On

Ancyker
1 year ago
This function doesn't always produce the expected results if you have a needle that isn't UTF-8 but are looking for it in a UTF-8 string. This won't be a concern for most people, but if you are mixing old and new data, especially if reading data from a file, it could be an issue.

Here's a "mb_*"-esque function that searches the string:

<?php
function mb_str_contains(string $haystack, string $needle, $encoding = null) {
return
$needle === '' || mb_substr_count($haystack, $needle, (empty($encoding) ? mb_internal_encoding() : $encoding)) > 0;
}
?>

I used mb_substr_count() instead of mb_strpos() because mb_strpos() will still match partial characters as it's doing a binary search.

We can compare str_contains to the above suggested function:

<?php
// Some Unicode Kanji (漢字はユニコード)
$string = hex2bin('e6bca2e5ad97e381afe383a6e3838be382b3e383bce38389');

// Some Windows-1252 characters (ãƒ)
$contains = hex2bin('e383');
// ^ file_get_contents() produces the same data when it is saved as "ANSI" in Notepad on Windows, so this is not that unrealistic. The only reason to use hex2bin here is to mix character sets without having to use multiple files.

// A character that actually exists in our string. (ー)
$contains2 = hex2bin('e383bc');

echo
" = Haystack: ".var_export($string, true)."\r\n";
echo
" = Needles:\r\n";
echo
" + Windows-1252 characters\r\n";
echo
" - Results:\r\n";
echo
" > str_contains: ".var_export(str_contains($string, $contains), true)."\r\n";
echo
" > mb_str_contains: ".var_export(mb_str_contains($string, $contains), true)."\r\n";
echo
" + Valid UTF-8 character\r\n";
echo
" - Results:\r\n";
echo
" > str_contains: ".var_export(str_contains($string, $contains2), true)."\r\n";
echo
" > mb_str_contains: ".var_export(mb_str_contains($string, $contains2), true)."\r\n";
echo
"\r\n";
?>

Output:

= Haystack: '漢字はユニコード'
= Needles:
+ Windows-1252 characters
- Results:
> str_contains: true
> mb_str_contains: false
+ Valid UTF-8 character
- Results:
> str_contains: true
> mb_str_contains: true

It's not completely foolproof, however. For instance, ド in Windows-1252 will match ド from the above string. So it's still best to convert the encoding of the parameters to be the same first. But, if the character set isn't known/can't be detected and you have no choice but to deal with dirty data, this is probably the simplest solution.

<< Back to user notes page

To Top