String, setlocale, mbstowcs, wcstombs and the filesystem


#1

The documentation says that the String class internal representation of the data is in unicode (actual wchar_t is encoded in UCS-2 or UCS-4 dependend on the current os) or in ASCII. This is controlled by the JUCE_STRINGS_ARE_UNICODE-define. If the other representation is needed it will be converted from the internal one via mbstowcs or wcstombs. But these functions don’t convert between unicode and ASCII, but between unicode an the encoding of the current locale, which is by default the C locale. In the default case it’s right to say mbstowcs and wcstombs convert between unicode and ASCII, because the C locale uses ASCII.
On windows I didn’t have any problems with this, because JUCE and the part of the WinAPI concerning the filesystem communicate via unicode. But on linux JUCE uses these functions to produce strings i.e. for fopen. There is no problem until you have non-ASCII characters in your paths like ü, ö, ä, ß or something else. Then JUCE will fail to handle those paths correctly. It’s not JUCEs fault after all. I tooks me several hours to figure out that I need to set the locale via

#include <locale.h> ... setlocale(LC_CTYPE, "");
to the current system-locale. In my case this includes UTF-8 as encoding on my ubuntu-box. So fopen expects a path encoded in UTF-8 rather than in ASCII. With the locale set to the system-locale wcstombs will convert unicode to UTF-8 and fopen and other functions related to this will be happy.

This leads to two suggestions I would make:

  1. In the documentation and in the code should not be refered to strings in locale-encoding as ASCII.
  2. The documentation should give a hint to set the locale to the system-local via setlocale(LC_CTYPE, “”) if problems with paths or something related to this occur on linux.

#2

Thanks for digging that up for me - I’d missed it on linux, which I tend not to use very heavily.

Ideally it’d be better for Juce to set the locale to UTF-8 and always use that for filenames (all filenames on the mac are done as UTF-8). I’ll brush-up on my knowledge of locales and see what I can sort out.

(And you’re quite right in saying I shouldn’t say “ascii” - that’s a throwback to old code and I’ll do a search and tidy it up…)


#3