String representation

Just to make sure I am understanding things right:

  • String stores its data in Unicode format

  • The character width (wchar_t) is 16 on Windows, 32 on Mac/Linux

  • The encoding is UTF16 for Windows, UTF32 for Mac/Linux

  • Calling String::toUTF8() will cause the String to increase in size since it puts the converted buffer at the end (don’t worry I can work around this by making a copy of the original String before converting it, then discarding it).

With UTF32 there is a 1:1 representation of code points to char values but with UTF16 it is possible that 2 char values will be needed to decode to one code point, so how can any code that indexes directly into the string by position work for those code points?

That’s about right. I’m treating a wchar_t as being a 1:1 unicode character, and if win32 wants to make a wchar too small to actually hold a unicode character, I’m treating that as win32’s problem/fault. Haven’t had any problems with it yet though. I’ve considered moving the string class to use UTF8 internally - that’s something I may do in the future.

Funny that you mention it because that is exactly what I was thinking. UTF8 would have the same issue with indexing though. My app has literally hundreds of thousands of strings and keeping them in 32-bit wide chars is simply not an option! What I was going to do was perhaps make a copy of the juce::String, strip out everything except the storage, modify it to keep things as Utf8, keep all my data in this new utf8 string, and then when I need to work with a regular Juce String (for example, drawing a label or row in a listbox) convert the utf8 back to juce::String.

Most definitely a non-trivial change, any code that indexes into strings has to be made aware of the encoding for example Graphics::drawText(), routines which justify text, and in particular the TextEditor which I imagine would be a nightmare. And I know this because I did something similar in my home-brew framework.

Yes, it’s just the indexing that puts me off using utf8 - It’d probably mean getting rid of operator and replacing it with an iterator, just to stop people wondering why seemingly innocent bits of code have suddenly become ridiculously slow.

Another option would be to use a technique like I did for the var class to allow the string to be stored internally using different formats, so you could choose the best one for whatever you’re doing (e.g. use a unicode format to do some random access operations, then switch it to utf8 to conserve memory). But of course this would add a virtual method call to each operation and an extra pointer to each string.

For my use case it is only important for me to store, display, sort, and search for my strings. Since there are so many of them, UTF8 is the preferred way to store them. These strings get changed and manipulated rarely, and for those cases I am completely happy with converting them into UTF32 or UTF16, performing the operations, and then converting them back.

In fact if all of Juce used 32-bit wchar_t even on Windows, it would not bother me in the slightest as long as I had a Utf-8 version of String that was freely convertible.

I don’t get it. Why would operator be slower than an iterator ?
Usually, in the other String class with UTF-8 content (Bstr, old Qt, wxString), the operator access the i-th byte, but they have a getCharacterAtIndex(i) method that does the UTF-8 length decoding (which is not that slow, since you only have one-bit-per-byte test, it’s as faster as the “< length” test). In 99% of the string operations, you don’t care at all about the character themselves, since the UTF-8 encoding is exact (well, sort of), you can usually compare UTF-8 string as memcmp.

Since writing a real international character aware string class is a challenge (collation, tolower, character ordering, whitespace, left-to-right/right-to-left, etc…), none of the string class I know about do this.
If you ever attempt this, you’ll face a mountain, for no/little interest anyway. You can have a look to ICU for one, or iconv.

I see juce_win32_Files.cpp as running into trouble here. I’m not concerned with the internal representation inside the String class, just that the String class doesn’t support surrogate pairs so it’s possible to pass a UTF-8 encoded filename containing a surrogate pair to String::fromUTF8 and then have CreateFile open some other file.

I have code for handling surrogate pairs as well as some other handling for invalid UTF-16, etc. that I’d like to contribute. Before I submit a patch, I’d like to know if you’re interested in it, and if so how you’d like to structure it. Either where to put the #define that identifies Windows as having a 2 byte wchar_t, and then whether to use it in juce_String.cpp or have some code in src/native/windows or what.

Thanks for your help.

-DB

Thanks, always interested to see what mods you’ve got, though I’d want to keep the String class free from knowing anything about win32. Have to admit that I’ve never heard of a surrogate pair, though, so you may have thought it through more deeply than me! Feel free to email me directly if you want to send me stuff!

Surrogate pair:

Short answer: solves the problem of how to represent a 32-bit unicode code point using two 16-bit integers, or two or more 8-bit integers. A flag in the first value indicates that one or more additional wide characters follow and need to be combined to produce the code point (utf-16 or utf-8 encoding).

I agree that the String class shouldn’t have to know about Win32. But it would be really cool to have something like this:

typedef char* AsciiChar;
typedef unsigned char* Utf8Char;
typedef uint16* Utf16Char; // or wchar_t depending on preference and environment;
typedef uint32* Utf32Char; // or wchar_t depending on preference and environment;

template<typename CharType>
class UtfEncodedString
{
  /*...fill this part in Jules!...*/
};

And then these handy typedefs:

typedef UtfEncodedString<AsciiChar> AsciiString;
typedef UtfEncodedString<Utf8Char> Utf8String;
typedef UtfEncodedString<Utf16Char> Utf16String;
typedef UtfEncodedString<Utf32Char> Utf32String;

Template specializations should provide conversions between all of the types (throwing an exception if a utf-encoded string can’t be represented as Ascii).

If UtfEncodedString has almost all the functionality of the existing String, then we can replace the existing String with a typedef, allowing the user of the library to determine how strings are stored:

typedef Utf8String String; // everything stored as Utf8

If you have those classes, then at least for Windows you can change every line of code that calls a Windows function, to explicitly call the wide Unicode version (Win32 API routines with the letter W appended to it). For example:

juce_win32_Windowing.cpp

    void Win32ComponentPeer::setTitle (const String& title)
    {
        //SetWindowText (hwnd, title); /* ascii or unicode depending on compile settings...we dont want this*/
        SetWindowTextW (hwnd, Utf16String (title)); // better, now we don't care how title is encoded.     
    }

With something like the UtfEncodedString template class, a developer can achieve true complete mastery over all strings! Any encoding, freely convertible and assignable, passed through functions, etc… Of course there would need to be a character iterator framework to replace the direct indexing using the array operator, but we knew that (TextEditor would be the toughest).

Come to think of it, TextEditor or any other piece of difficult code can just work with Utf32 strings, converting everything internally. And Utf32 and Ascii encoded strings can keep the array indexing operator (would have to use some template/SFINAE magic to make operator[] available only for those two types).

These are just some ideas I have played around with and done some experimenting with, don’t get offended if it’s not your style!

If you don’t know about Unicode encodings, this is most definitely the place to start! It’s the definitive introduction:

[size=150]The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)[/size]
by Joel Spolsky

(http://www.joelonsoftware.com/articles/Unicode.html)

Thanks for the enormously over-the-top reply! I do actually understand unicode and know how the encodings work, I’d just forgotten/never heard the term “surrogate pair”, I do understand the concept!

I’ve also got a few files with scribbled ideas and experiments for ways to structure strings encodings, I’ll figure out something neat with them when I get chance.

Heh…you’re welcome. I love any excuse to stop working and be a forum junkie. When I lay awake at night and fantasize about Utf encoded strings, that is what I imagine. I can’t think of anything more to do with Unicode strings that can’t be done with something like that so I figured I would throw it up there for fun.

[quote=“jules”]Thanks for the enormously over-the-top reply! I do actually understand unicode and know how the encodings work, I’d just forgotten/never heard the term “surrogate pair”, I do understand the concept!

I’ve also got a few files with scribbled ideas and experiments for ways to structure strings encodings, I’ll figure out something neat with them when I get chance.[/quote]
Let me know if I can help, perhaps by sending routines that work that you can disect / reassemble as you wish? They pass all the tests mentioned here: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt.

-DB

Considering the vast amount of memory of today’s computers why not just use 32 Bit character representations for String, and avoid all hassle that’s involved with surrogates? I bet you won’t see any difference in speed either, because on a CPU, byte-access is today typically implemented in micro-code, so bytes are often actually slower to access than int32’s.

Hmm 800 gigabytes of mp3s with meta data, and all strings in 32 bit instead of 8 bit utf? I think I’ll pass on that one.

Very good example, TheVinn. That’s about 200.000 MP3s (if one MP3 is on average 4MB). Let’s consider 1 MP3 has about 250 characters of metadata on average. That’s about 1 KB of UTF32 String data then. So we have about 200.000 KB = 200MB of metadata. That’s ok for such a vast amount of MP3’s. The current implementation would use up 100MB on Windows. I don’t think it’s a problem to use 100MB more with such a big collection. With such a big collection, the most important thing is to be able to quickly search strings, indexing, etc… The necessity to deal with surrogates would be a major performance bottleneck here in some cases! So if it was me, yeah, I’d absolutely love to give away 1/40 of my 4GB RAM for better performance when dealing with my 200.000 tunes collection !
BTW, 200.000 MP3s at 1$/MP3 = 200.000$ :wink:

Lets not fool ourselves, performance comes from fitting the data into the CPU cache (http://www.1024cores.net/home/lock-free-algorithms/first-things-first).

Larger data set = poorer performance, period.

The overhead of dealing with surrogates is well worth the gains of shoving much fewer bytes through the cache lines of a modern processor. And I know this from testing…my original implementation was all UTF-32. Now I keep it all as UTF-8.

Did I say 800GB? I meant a terabyte. Anyway your numbers are off:
[size=150]
iDJPool - Servicing DJ’s Since 1985
[/size]

$50/month and you get B sides, remixes, and prereleases.

:wink: back.

Ok, if you said you tested UTF32 vs. UTF8 performance, I believe you. Ofcourse transferring 4x more memory takes 4x more time, we don’t need to argue about this, but somehow (by experience) I have the impression that the most straightforward implementations are often the best choice on computers. And UTF32 just seems elegant and straightforward - especially when you look at the ever-growing amount of asian computer users.
I’ll send you some PM about that website you just indicated.