Surrogate pair:
Short answer: solves the problem of how to represent a 32-bit unicode code point using two 16-bit integers, or two or more 8-bit integers. A flag in the first value indicates that one or more additional wide characters follow and need to be combined to produce the code point (utf-16 or utf-8 encoding).
I agree that the String class shouldn’t have to know about Win32. But it would be really cool to have something like this:
typedef char* AsciiChar;
typedef unsigned char* Utf8Char;
typedef uint16* Utf16Char; // or wchar_t depending on preference and environment;
typedef uint32* Utf32Char; // or wchar_t depending on preference and environment;
template<typename CharType>
class UtfEncodedString
{
/*...fill this part in Jules!...*/
};
And then these handy typedefs:
typedef UtfEncodedString<AsciiChar> AsciiString;
typedef UtfEncodedString<Utf8Char> Utf8String;
typedef UtfEncodedString<Utf16Char> Utf16String;
typedef UtfEncodedString<Utf32Char> Utf32String;
Template specializations should provide conversions between all of the types (throwing an exception if a utf-encoded string can’t be represented as Ascii).
If UtfEncodedString has almost all the functionality of the existing String, then we can replace the existing String with a typedef, allowing the user of the library to determine how strings are stored:
typedef Utf8String String; // everything stored as Utf8
If you have those classes, then at least for Windows you can change every line of code that calls a Windows function, to explicitly call the wide Unicode version (Win32 API routines with the letter W appended to it). For example:
juce_win32_Windowing.cpp
void Win32ComponentPeer::setTitle (const String& title)
{
//SetWindowText (hwnd, title); /* ascii or unicode depending on compile settings...we dont want this*/
SetWindowTextW (hwnd, Utf16String (title)); // better, now we don't care how title is encoded.
}
With something like the UtfEncodedString template class, a developer can achieve true complete mastery over all strings! Any encoding, freely convertible and assignable, passed through functions, etc… Of course there would need to be a character iterator framework to replace the direct indexing using the array operator, but we knew that (TextEditor would be the toughest).
Come to think of it, TextEditor or any other piece of difficult code can just work with Utf32 strings, converting everything internally. And Utf32 and Ascii encoded strings can keep the array indexing operator (would have to use some template/SFINAE magic to make operator[] available only for those two types).
These are just some ideas I have played around with and done some experimenting with, don’t get offended if it’s not your style!