BR: String::toUpperCase is wrong

With the latest Apple Clang/Xcode – test code:

const String source (CharPointer_UTF8 ("\xc3\x8b, \xc3\xab"));  // "Ë, ë"
const auto upperCase = source.toUpperCase();
jassert (upperCase == CharPointer_UTF8 ("\xc3\x8b, \xc3\x8b")); // Expected: "Ë, Ë"
jassert (upperCase.getCharPointer().isUpperCase());             // Expected "true"

Presumably this has to do with CharacterFunctions::toUpperCase and subsequently towupper… but couldn’t say how or why.


Edit: it seems that towupper isn’t really going to do anything expected as it’s predicated around the C++ locale stuff and is implemented defined/has no standard – which is generally pretty bogus.

Strangely, towupper works for me on Godbolt with GCC and Clang, and doesn’t with MSVC. Fun stuff. Some test code:

#include <iostream>
#include <cwctype>
#include <clocale>
 
int main()
{
    wchar_t c = L'\u00eb';
 
    std::cout << std::hex << std::showbase;
    std::cout << "in the default locale, towupper(" << (std::wint_t)c << ") = "
              << std::towupper(c) << '\n';
    std::setlocale(LC_ALL, "en_US.utf8");
    std::cout << "in Unicode locale, towupper(" << (std::wint_t)c << ") = "
              << std::towupper(c) << '\n';
}

Can someone look at this possible and platform agnostic solution to JUCE?

1 Like

+1 this looks like it was a tedious PR, and I have had issues with string conversions over API boundaries, this PR would really help out a lot with making an audio device selection API I’ve written a lot better the next time it gets used in a project.

It really wasn’t. :slight_smile: Google Sheets to tidy up the source data + copy/paste from there + multiline selection in whatever IDE I used at the time to format.

1 Like

I see a potential problem with that implementation:

toLowerCase() becomes a linear search, thus it always has complexity O(n), while toUpperCase() only has the complexity of a search in the unordered_map, which is:

Constant on average, worst case linear in the size of the container.

I think this asymmetry between toLowerCase() and toUpperCase() is unexpected for most of users, which would be unaware that one has better performance than the other.

The point above also affects isUpperCase(), which in the non-Windows case delegates the check to toLowerCase()

Totally fair to bring up, and fwiw I’m not particular pleased by the approach for that reason. I can’t give this deep thought right this moment, but so far I think my options are to change the data structure to something else (but what…), use an array of pairs and apply linear search at all times (not great…), or have two copies of the data in a map of some form (one instance for upper to lower, and another for lower to upper).

I’d do it like this:

  1. one free function (getMapLowerToUpper) that returns the map that maps lower → upper (the one you already have) as a singleton

  2. one free function (invert) that takes a map and returns an “inverted” map, obtained flipping keys and values from the map passed as argument

  3. another free function (getMapUpperToLower) that returns the map that maps upper → lower, obtained at runtime and only the first time it is invoked, by inverting the map that maps lower → upper.

I changed the global variable you had with a free function returning a reference to it, to stay well away from the static initialization fiasco. Doing it this way guarantees that you only have to manually fill the data for one of the two maps, then both maps are initialized only once, and exactly when needed, not before and most importantly not after! :laughing:

(code below may not compile, I just wrote it as example of my idea. Bonus points for making invert() a function template that accepts whatever map type one throws at it)

const auto& getMapLowerToUpper ()
{
    static const std::unordered_map<juce_wchar, juce_wchar> map =
    {
        { 0x0061, 0x0041 },
        { 0x0062, 0x0042 },
        { 0x0063, 0x0043 },
        // lower -> upper
    };

    return map;
}

const std::unordered_map<juce_wchar, juce_wchar> invert (const std::unordered_map<juce_wchar, juce_wchar>& map)
{
    const std::unordered_map<juce_wchar, juce_wchar> invertedMap;
    for (const auto& pair : map)
        invertedMap [pair.second] = pair.first;

    return invertedMap;
}


const auto& getMapUpperToLower ()
{
    static const std::unordered_map<juce_wchar, juce_wchar> map = invert (getMapLowerToUpper());

    return map;
}

Caveat: the above only works if there are no distinct lowercase chars that map to the same uppercase char. But that’s also an issue in the implementation you proposed and I’m not sure if there are some corner cases like that

AFAIK or understand so far, these funcs were always intended to be 1:1 for conversions. I hope someone else can confirm/deny!

QString seems to go through an assortment of hoops and other things, but avoids the C style functions and STL’s locale (because they have a QLocale, because of course): qstring.cpp « tools « corelib « src - qt/qtbase.git - Qt Base (Core, Gui, Widgets, Network, ...)

To be fair, the goal for what I have up in PR is to get something slightly more consistent and useful than what’s there, and not the ultimate well rounded and all encompassing solution.

1 Like

If it were my code, I’d add a simple jassert in the “invert” function that checks whether the “key” it’s going to insert in the inverted map is already there. If that gets triggered, it means there are two lowercase chars in the initial map that go to the same uppercase char. At least you would be warned when that happens, so you can decide what to do.