Tracking down a UTF8 issue


#1

Hi,

I have a problem with non-ascii text and I'm unsure where the problem arises since I don't know how to tell VS2012 to display my variables correctly when debugging. I.e I can't see if a String contains the proper bytesequence of UTF8 chars or not.

The docs say that String is encoding the text as UTF8 by default and I haven't changed that.

Lets hope the forum can display UTF8 or this post will become really confusing!

 

For example, this line of code: 

String c8 = CharPointer_UTF8("järnmalm");

seem to display the characters correctly in the VS2012 debugger even though it says it's a char* as "Type" (in the "Locals" or "Auto" panes).

However, when I use the File::findChildFiles to enumerate files, I get files on occasion that can't be represented as pure ASCII, for example a file called järnmalm.txt. When using File::getFullPathName on such a file, VS2012 displays the text in an odd way. It would seem that Juce as a whole is still able to work properly on the file represented by that string filename though. Any thoughts on that?

But it gets screwed up when I use the URL::addEscapeChars on the filename string. The result is mangled beyond recognision and is quite certainly wrong.


#2

Trying to figure out if a URL can be specified in UTF8 in the first place but apparently that is not possible?


#3

.


#4

Hi Mike, you are violating an invariant of the String class when you say this:

String c8 = CharPointer_UTF8("järnmalm");

See what happens when you do something like this:

String c8 = CharPointer_UTF8("järnmalm");
AlertWindow::showMessageBox(AlertWindow::InfoIcon, "", c8); // XXX this might not end well!!!

That can cause problems! On my system (Win7), the call to AlertWindow leads to a crash! What you actually want to say is this:

String c8 = CharPointer_UTF8 ("j\xc3\xa4rnmalm");

This was created by using the utility found under Introjucer->Tools->UTF-8 String-Literal Helper.

When you say CharPointer_UTF8(...), the contents are of type CharPointer_UTF8::CharType*, which equates to char*. This is why Visual Studio can display it "correctly" (because it's not UTF-8). However, the contract for this function is that the underlying byte array corresponds to a UTF-8 encoded stream of characters. The other characters (j,r,n,m,a,l) can all be represented with a single byte, just like ASCII. However, the ä takes two bytes in UTF-8.

I believe that Visual Studio has a difficult time displaying UTF-8 in debug mode, so with my correction above, you'll probably see this in the debugger:

text = {data=0x032a0998 "järnmalm" }

You can use a hex editor to verify that data is correct:

Hex Editor Display

Note that this is the same byte sequence that Introjucer's String Literal Helper created. So even though VS has a problem with the display, the data is correct internally.

Rendering a String to a JUCE GUI Component is a good way to spot check your strings. You can use AlertWindow::showMessageBox() to display the string during testing.

Note that when you say String c8 = CharPointer_UTF(...), you are copying data directly (no re-interpretation of encodings) to the String's internal memory--it must be proper UTF-8.


#5

I use UTF-8 encoded Strings (the default in JUCE) for URLs all the time:

String entry = CharPointer_UTF8("K\xc3\xa4se");   // Käse (German: cheese)
String wiki = "http://de.wikipedia.org/wiki/" + entry;

URL url(wiki);    
String rawData = url.readEntireTextStream();

I personally don't use URL::addEscapeChars(), but it seems like for it to work the way I would expect, the above url should be encoded like this:

http://de.wikipedia.org/wiki/K%C3%A4se

You can paste this into a browser and read all about German cheese ;)

For whatever reason that I also don't understand, URL::addEscapeChars won't format it like I've shown above. It's pretty clear that forward slashes / and colons : are converted into percent %xx sequences:

String URL::addEscapeChars (const String& s, const bool isParameter)
{
    const CharPointer_UTF8 legalChars (isParameter ? "_-.*!'()"
                                                   : ",$_-.*!'()");

    Array<char> utf8 (s.toRawUTF8(), (int) s.getNumBytesAsUTF8());

    for (int i = 0; i < utf8.size(); ++i)
    {
        const char c = utf8.getUnchecked(i);

        if (! (CharacterFunctions::isLetterOrDigit (c)
                 || legalChars.indexOf ((juce_wchar) c) >= 0))
        {
            utf8.set (i, '%');
            utf8.insert (++i, "0123456789abcdef" [((uint8) c) >> 4]);
            utf8.insert (++i, "0123456789abcdef" [c & 15]);
        }
    }

    return String::fromUTF8 (utf8.getRawDataPointer(), utf8.size());
}

I don't understand the API design on this one. Maybe it's a bug, or maybe we're both trying to use it in a way that it wasn't designed! Jules might not have the time to address this, but it would take about 5 mins to modify what he has above and work out your own conversion function.

See my post below for more details about UTF-8.


#6

Thanks matty for a very informative answer. BTW, I would normally never dare to do this:

String c8 = CharPointer_UTF8("järnmalm");

but for testing. I think the test didn't really give me anything but confusion.

In normal cases, I really do get the C3A4 sequence so perhaps the representation is correct after all. Thank you for that.


#7

I don't understand the API design on this one. Maybe it's a bug, or maybe we're both trying to use it in a way that it wasn't designed! Jules might not have the time to address this, but it would take about 5 mins to modify what he has above and work out your own conversion function.

I have now abandoned feeding it UTF8 text, for two reasons:

* Instead of the %00%11 notation I sometimes (!?) get lots of garbled nonsens.

* It would seem (someone correct me if I'm wrong), a URL can't be UT8 in the first place. I'm really uncertain here.


#8

The reason for starting digging into this, is that one chinese beta tester also gets nonsens in the GUI. I enumerate filenames and convert the filenames to strings, which I add to a combo box. When I change my own Windows 7 to chinese I get the properly displayed text on screen, but not this guy. It also seem to happen on Mac, but for some reason, not in every case. Very confusing. If only software problems were universally reproducable!