How to compare UTF-8 strings?

MBO · May 21, 2019, 9:22am

I’m wondering how to deal with UTF-8 string comparison in JUCE in the most efficient way. Is the only solution to normalize strings using some external library like ICU or utf8proc and compare them after?

jrlanglois · May 21, 2019, 9:42am

juce::String already does this work for you. By default, its internals work with UTF-8. In addition, this is a configurable option: the String class can internally store data as either UTF-16 and UTF-32 if you so need that (which is pretty rare!).

jrlanglois · May 21, 2019, 9:46am

I should add that you can convert between UTF8/16/32 using the toUTFn functions.

eg: juce::String::toUTF8()

daniel · May 21, 2019, 9:57am

But would be nice to have a String compare, that matches Müller and Mueller optionally…

jrlanglois · May 21, 2019, 10:06am

Yeah, that would mean having language databases like what ICU employs, and break iterators to go along with it. It’s pretty complicated stuff!

jules · May 21, 2019, 10:07am

On OSX there’s a String::convertToPrecomposedUnicode() method, but unfortunately there’s no practical way for us to embed cross-platform functionality to do that kind of operation. There are unicode libraries that do it, but they’re enormous and complicated to build.

jrlanglois · May 21, 2019, 10:15am

Enormous and complicated also applies to use and maintain.

Take ICU for example; if you don’t want to the entire data library and all of ICU’s features, it’s very difficult to make that work correctly on all systems.

You have to maintain data files containing subsets
You have to make sure that the data works with specific versions of ICU (due to poor backward and forward compatibility… speaking from experience here!)
You have to suffer the overhead of representing 2 separate data packages when dealing with LE and BE systems.

Lots can go wrong very easily… It’s unfortunate that the design and architecture for ICU is so poor.

daniel · May 21, 2019, 10:19am

I absolutely agree, it is a can of worms, and not really in the focus of an audio framework.

I get horrified just thinking about, how many spellings I saw for Tschaikowski, Tchaikovsky or any other… So we Germans with 3 umlauts and one special ß are the minor problem

MBO · May 21, 2019, 11:26am

Thanks for the answers!

I need to compare for example Müller and Myller, so right now using String::compare does not give the expected result… I understand these strings should be normalized first and then compared… For example in Qt there is String::normalized method for this purpose. I’d like to avoid using ICU for the reasons already mentioned.

Of course I am not suggesting here that JUCE should be extended with some heavyweight classes with all national subsets, I’d rather want to find a simple workaround, don’t need to compare ü and ue or ß and ss.

jrlanglois · May 21, 2019, 12:56pm

In fairness, juce wraps the iOS and macOS APIs which does the job well enough (like Jules mentioned above - String::convertToPrecomposedUnicode()).

Its API could be extended to wrap the remaining native functions in a semi-consistent way:

I’m pretty sure in Linux’s case you would need to use ICU. Don’t quote me on that - research is coming up dry for POSIX APIs.

Definitely a can of worms that would involve many unit tests, imo.

MBO · May 21, 2019, 1:33pm

Thanks, that’s a good starting point.

Unfortunately I need to have it on Linux in the first place (embedded devices). I’ll try to check some less heavy alternatives to ICU first.

MBO · November 1, 2019, 5:19pm

If somebody is interested in comparing UTF-8 strings according to UTF-8 standards and successful uppercase/lowercase operations, then utf8proc library works as expected on all platforms and (at least for me) is a sufficient replacement for ICU library. It adds only 300KB to release builds.

Topic		Replies	Views
JUCE_STRING_UTF_TYPE 32 build errors General JUCE discussion	9	666	September 13, 2013
Changes to String and 'good old ASCII' General JUCE discussion	8	991	February 10, 2011
How to correctly use UTF8 with JUCE strings General JUCE discussion	1	524	May 1, 2013
String and wchar_t* General JUCE discussion	3	998	February 13, 2011
Strange in UTF8 String General JUCE discussion	1	378	June 22, 2013

How to compare UTF-8 strings?

Purchase

Discover

Learn

Support

About

Events

How to compare UTF-8 strings?

Related Topics

Purchase

Discover

Learn

Support

About

Events