Strangeness with UTF8!

Hello, Jucers.

I have an interesting issue regarding UTF-8 where I feel that the behavior of Juce has changed, and is now incorrect.

It used to be that if CharPointer_UTF8::isValidString(p, size) returned false, you were guaranteed that String::fromUTF8(p, size) would fail - so when I convert from strings, (e.g. from files) I always check CharPointer_UTF8::isValidString()

I added a lot of translated text to my program, and a lot of it seems to fail CharPointer_UTF8::isValidString() - yet the string looks like a perfect good UTF-8 string. And now, if I ignore CharPointer_UTF8::isValidString(), then I get a correct string!

What’s going on here?

inline String str(const string& s) { const char* p = s.c_str(); size_t size = s.size(); bool valid = CharPointer_UTF8::isValidString(p, size); if (!valid) { LOG(ERROR) << "Badly encoded string |" << s << "| " << s.size(); LOG(ERROR) << s[0] << ", " << s[1]; valid = true; // HACK - IGNORE THE FACT THAT THIS STRING IS BAD! } return valid ? String::fromUTF8(p, size) : "(badly encoded string)"; }

with result:

I0819 23:36:21.582591 2956513280 Instance.cpp:267] registered E0819 23:36:21.731360 2696910144 Juce.h:101] Badly encoded string |Öffnen Sie den letzten| 23 E0819 23:36:21.731408 2696910144 Juce.h:102] \303, \226

In juce_CharPointer_UTF8.h, looking at the definition of CharPointer_UTF8::isValidString(), I think ‘dataToTest’ needs to be bumped before the “extra values” are checked for validity. In other words change

[code] n &= mask;

while (--numExtraValues >= 0)
{[/code]

to

n &= mask; dataToTest++; while (--numExtraValues >= 0) {

Which should make your simple testcase work (it worked for me).

Also, the Wikipedia article on UTF-8 implies that about half of the 4-byte sequences are valid, so you might want to relax the previous check

if (bit <= 0x10) return false;
to

if (bit <= 0x8) return false;

Although then you may get false positives instead of false negatives. What you really need in the case of (bit == 0x10) is to sanity-check that the actual decoded Unicode character is <= 0x10ffff, which I am currently too lazy to write the code for.
Hope this helps.

Wow, looking at that method again, it’s actually a real piece of crap - there are several stupid bugs in it apart from the ones you mentioned. Quite embarrassing really, I was obviously having a bad day when I wrote it! Thanks for the heads-up, I’ll kick some shape into it today!

Well done, buck!

And don’t get too down, Jules - every diamond codebase has a little dreck somewhere in its interstices…