How to compare UTF-8 strings?

I’m wondering how to deal with UTF-8 string comparison in JUCE in the most efficient way. Is the only solution to normalize strings using some external library like ICU or utf8proc and compare them after?

juce::String already does this work for you. By default, its internals work with UTF-8. In addition, this is a configurable option: the String class can internally store data as either UTF-16 and UTF-32 if you so need that (which is pretty rare!).

I should add that you can convert between UTF8/16/32 using the toUTFn functions.

eg: juce::String::toUTF8()

But would be nice to have a String compare, that matches Müller and Mueller optionally…

Yeah, that would mean having language databases like what ICU employs, and break iterators to go along with it. It’s pretty complicated stuff!

1 Like

On OSX there’s a String::convertToPrecomposedUnicode() method, but unfortunately there’s no practical way for us to embed cross-platform functionality to do that kind of operation. There are unicode libraries that do it, but they’re enormous and complicated to build.

Enormous and complicated also applies to use and maintain. :slight_smile:

Take ICU for example; if you don’t want to the entire data library and all of ICU’s features, it’s very difficult to make that work correctly on all systems.

  • You have to maintain data files containing subsets
  • You have to make sure that the data works with specific versions of ICU (due to poor backward and forward compatibility… speaking from experience here!)
  • You have to suffer the overhead of representing 2 separate data packages when dealing with LE and BE systems.

Lots can go wrong very easily… It’s unfortunate that the design and architecture for ICU is so poor.

I absolutely agree, it is a can of worms, and not really in the focus of an audio framework.

I get horrified just thinking about, how many spellings I saw for Tschaikowski, Tchaikovsky or any other… So we Germans with 3 umlauts and one special ß are the minor problem :wink:

Thanks for the answers!

I need to compare for example Müller and Myller, so right now using String::compare does not give the expected result… I understand these strings should be normalized first and then compared… For example in Qt there is String::normalized method for this purpose. I’d like to avoid using ICU for the reasons already mentioned.

Of course I am not suggesting here that JUCE should be extended with some heavyweight classes with all national subsets, I’d rather want to find a simple workaround, don’t need to compare ü and ue or ß and ss.

In fairness, juce wraps the iOS and macOS APIs which does the job well enough (like Jules mentioned above - String::convertToPrecomposedUnicode()).

Its API could be extended to wrap the remaining native functions in a semi-consistent way:

I’m pretty sure in Linux’s case you would need to use ICU. Don’t quote me on that - research is coming up dry for POSIX APIs.

Definitely a can of worms that would involve many unit tests, imo.

1 Like

Thanks, that’s a good starting point.

Unfortunately I need to have it on Linux in the first place (embedded devices). I’ll try to check some less heavy alternatives to ICU first.

If somebody is interested in comparing UTF-8 strings according to UTF-8 standards and successful uppercase/lowercase operations, then utf8proc library works as expected on all platforms and (at least for me) is a sufficient replacement for ICU library. It adds only 300KB to release builds.