Conversion to multi-byte character strings is not correct


#1

When using a real multi-byte character set, the conversion from wide character strings to multi-byte character strings is not correctly implemented, e.g. in the “String::operator const char *”.
This is because the conversion is based on the assumption that the length in bytes of the multi-byte string will be the same as the string length of the wide character string. This does not hold, when a non-ascii character is really represented as more than one byte. The resulting multi-byte string is truncated and too short.
I have that situation especially in Linux (Ubuntu), where I want to use german umlauts (you have to call setlocale(LC_CTYPE, ""); at program startup to be able to use umlauts etc.), because the character set in the locale is a UTF-8 character set. Which meens, that umlauts are represented as 2 bytes. And I have to pass my string to other modules, which do not support wide characters, so I have to convert them.
But this would also be true, if you have a real multi-byte character set on other platforms.

To fix that, I added the following functions in the class “CharacterFunctions” to get the target length before the conversion:

    static int lengthConverted (const char* const s) throw();
    static int lengthConverted (const juce_wchar* const s) throw();
...
int CharacterFunctions::lengthConverted (const char* const s) throw()
{
    return (int) mbstowcs (0, s, 0);
}

int CharacterFunctions::lengthConverted (const juce_wchar* const s) throw()
{
    return (int) wcstombs (0, s, 0);
}

and I changed the following functions in the class “String”. To be symetrical, I also changed the conversion functions from multi-byte strings to wide character strings, although the only problem in that case might be that some unneeded memory is allocated.

String::String (const char* const t) throw()
{
    if (t != 0 && *t != 0)
    {
#if JUCE_STRINGS_ARE_UNICODE
        const int len = CharacterFunctions::lengthConverted (t);
        createInternal (len);

        CharacterFunctions::copy (text->text, t, len + 1);
#else
        const int len = CharacterFunctions::length (t);
        createInternal (len);

        memcpy (text->text, t, len + 1);
#endif
    }
    else
    {
        text = &emptyString;
        emptyString.refCount = safeEmptyStringRefCount;
    }
}

String::String (const juce_wchar* const t) throw()
{
    if (t != 0 && *t != 0)
    {
#if JUCE_STRINGS_ARE_UNICODE
        const int len = CharacterFunctions::length (t);
        createInternal (len);

        memcpy (text->text, t, (len + 1) * sizeof (tchar));
#else
        const int len = CharacterFunctions::lengthConverted (t);
        createInternal (len);

        CharacterFunctions::copy (text->text, t, len + 1);
#endif
    }
    else
    {
        text = &emptyString;
        emptyString.refCount = safeEmptyStringRefCount;
    }
}

....

#if JUCE_STRINGS_ARE_UNICODE
String::operator const char*() const throw()
{
    if (isEmpty())
    {
        return (const char*) emptyCharString;
    }
    else
    {
        String* const mutableThis = const_cast <String*> (this);

        mutableThis->dupeInternalIfMultiplyReferenced();
        int len = CharacterFunctions::length (text->text) + 1;
        int lenConverted = CharacterFunctions::lengthConverted (text->text) + 1;
        mutableThis->text = (InternalRefCountedStringHolder*)
                                juce_realloc (text, sizeof (InternalRefCountedStringHolder)
                                                      + (len * sizeof (juce_wchar) + lenConverted));
        char* otherCopy = (char*) (text->text + len);
        --lenConverted;

        CharacterFunctions::copy (otherCopy, text->text, lenConverted);
        otherCopy [lenConverted] = 0;
        return otherCopy;
    }
}

#else

String::operator const juce_wchar*() const throw()
{
    if (isEmpty())
    {
        return (const juce_wchar*) emptyCharString;
    }
    else
    {
        String* const mutableThis = const_cast <String*> (this);

        mutableThis->dupeInternalIfMultiplyReferenced();
        int len = CharacterFunctions::length (text->text) + 1;
        int lenConverted = CharacterFunctions::lengthConverted (text->text) + 1;
        mutableThis->text = (InternalRefCountedStringHolder*)
                                juce_realloc (text, sizeof (InternalRefCountedStringHolder)
                                                  + (lenConverted * sizeof (juce_wchar) + len));

        juce_wchar* otherCopy = (juce_wchar*) (text->text + len);
        --lenConverted;

        CharacterFunctions::copy (otherCopy, text->text, lenConverted);
        otherCopy [lenConverted] = 0;
        return otherCopy;
    }
}

#endif

void String::copyToBuffer (char* const destBuffer,
                           const int maxCharsToCopy) const throw()
{
#if JUCE_STRINGS_ARE_UNICODE
    const int len = jmin (maxCharsToCopy, CharacterFunctions::lengthConverted(text->text));
    CharacterFunctions::copy (destBuffer, text->text, len);
#else
    const int len = jmin (maxCharsToCopy, length());
    memcpy (destBuffer, text->text, len * sizeof (tchar));
#endif

    destBuffer [len] = 0;
}

void String::copyToBuffer (juce_wchar* const destBuffer,
                           const int maxCharsToCopy) const throw()
{
#if JUCE_STRINGS_ARE_UNICODE
    const int len = jmin (maxCharsToCopy, length());
    memcpy (destBuffer, text->text, len * sizeof (juce_wchar));
#else
    const int len = jmin (maxCharsToCopy, CharacterFunctions::lengthConverted(text->text));
    CharacterFunctions::copy (destBuffer, text->text, len);
#endif

    destBuffer [len] = 0;
}

Also the constructors String::String (const char* const t, const int maxChars) throw(); String::String (const juce_wchar* const t, const int maxChars) throw();
need to be changed. I did not do that yet, because especially for the first one has to be very carefull what maxChars really means. Does it mean the number of bytes? In the second case it should be straight forward.

Regards,
Andreas


#2

What about static functions fromUTF8() / toUTF8() ?
Juce’s strings are UCS-2 encoded (unless you set something strange with your preprocessor macros).

Everytime you pass a char* in, Juce can’t deduce if it’s UTF8 or plain ASCII 8 bits. That’s why you have the fromUTF8() static constructor for such things.

I use them a lot (for very different languages, like Chinese, Hindi, and Persian), and they work very well.


#3

I known the functions fromUTF8()/toUTF8(). They work very well, but that is not the topic.

Juce’s strings are UCS-2 encoded, this is very good. Everytime I pass a “char*” in, or retrieve a “char*” from a Juce string, Juce converts to or from the current locale, since it uses the locale dependent functions “mbstowcs” und “wcstombs”. Which is also very good, because that is exactly what I need.

I am writing applications for german and for multiple platforms. I have to use the german locale, otherwise string functions as “toUpperCase” won’t work correctly for the umlauts. And this locale is on Windows the single byte code page 1252 (identical to ISO-8859-1 for the important things) and on Linux (Ubuntu) this is e.g. “de_AT.UTF-8”, which means 2 bytes for a single umlaut. And this case is not handled correctly by Juce’s string functions. That’s my point.

Do you expect an application to know which platform it is running on, and if the current locale is an UTF-8 locale and codes around every conversion from or to “char*”? I expect the framework to handle that, and I think that my post solves that.

Andreas


#4

Thanks andreas - this is really useful stuff, and I’ll get onto it asap!


#5

Ok, I’ve checked in a fix for this now. I’ve not done anything to the way multi-bytes are converted to unicode because I think it’s better to let that waste a couple of bytes rather than slow down the process by measuring the length twice. Let me know if there’s any problem with the changes.


#6

Ok, it works as I need it. But I want to add the following remarks:

  • If you or somebody else looks at this code some time in the future, not having in mind the reason for the change now, he/she might wonder why these similar cases are not handled in a similar way. And then might change it again in the wrong direction. So it should be noted somewhere what is the reason for doing it this way.
  • The argument, that a waste of a couple of bytes is better than the extra length computation is certainly true for a language like german where rather few characters are not ascii characters. It might not be so clear, if you consider languages that consist mostly or only of non-ascii characters.

#7

TBH for the future, I’d like to refactor the string class to make it more immutable, and possibly store the data internally in a range of formats rather than just unicode, so problems like this will disappear.