I have an interesting issue regarding UTF-8 where I feel that the behavior of Juce has changed, and is now incorrect.
It used to be that if CharPointer_UTF8::isValidString(p, size) returned false, you were guaranteed that String::fromUTF8(p, size) would fail - so when I convert from strings, (e.g. from files) I always check CharPointer_UTF8::isValidString()
I added a lot of translated text to my program, and a lot of it seems to fail CharPointer_UTF8::isValidString() - yet the string looks like a perfect good UTF-8 string. And now, if I ignore CharPointer_UTF8::isValidString(), then I get a correct string!
What’s going on here?
inline String str(const string& s) {
const char* p = s.c_str();
size_t size = s.size();
bool valid = CharPointer_UTF8::isValidString(p, size);
if (!valid) {
LOG(ERROR) << "Badly encoded string |" << s << "| " << s.size();
LOG(ERROR) << s[0] << ", " << s[1];
valid = true; // HACK - IGNORE THE FACT THAT THIS STRING IS BAD!
}
return valid ? String::fromUTF8(p, size) : "(badly encoded string)";
}
In juce_CharPointer_UTF8.h, looking at the definition of CharPointer_UTF8::isValidString(), I think ‘dataToTest’ needs to be bumped before the “extra values” are checked for validity. In other words change
[code] n &= mask;
while (--numExtraValues >= 0)
{[/code]
to
n &= mask;
dataToTest++;
while (--numExtraValues >= 0)
{
Which should make your simple testcase work (it worked for me).
Also, the Wikipedia article on UTF-8 implies that about half of the 4-byte sequences are valid, so you might want to relax the previous check
if (bit <= 0x10)
return false;
to
if (bit <= 0x8)
return false;
Although then you may get false positives instead of false negatives. What you really need in the case of (bit == 0x10) is to sanity-check that the actual decoded Unicode character is <= 0x10ffff, which I am currently too lazy to write the code for.
Hope this helps.
Wow, looking at that method again, it’s actually a real piece of crap - there are several stupid bugs in it apart from the ones you mentioned. Quite embarrassing really, I was obviously having a bad day when I wrote it! Thanks for the heads-up, I’ll kick some shape into it today!