How to convert text strings from old MAC-IS to UTF8?

I’m reading in some old text data (as C-Strings) that was apparently saved in MAC-IS format. (Maybe that’s also know as Roman?) I need to add these strings (preset names) to a StringArray so I can add an itemList to a Combobox.

I am encountering ellipsis specified as ‘\xc9’, and of course when I try to add it to a StringArray I get the assert of:

“you’re trying to create a string from 8-bit data that contains values greater than 127.”

I found this online:

<code_set_name> MAC-IS
<comment_char> %
<escape_char> /
% version: 1.0
CHARMAP

[…snip…]
<U00AB>     /xc7         LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
<U00BB>     /xc8         RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK

<U2026>     /xc9         HORIZONTAL ELLIPSIS

<U00A0>     /xca         NO-BREAK SPACE
<U00C0>     /xcb         LATIN CAPITAL LETTER A WITH GRAVE
<U00C3>     /xcc         LATIN CAPITAL LETTER A WITH TILDE
[…snip…]

How can I convert this automatically into a UTF8 or other format that juice::String will accept? I’ve tried various things but nothing’s working. I should mention I’m working on Mac right now…

If ‘MAC-IS’ is UTF16, you can use the juce::String UTF16 constructor.

    String (CharPointer_UTF16 (data));

or

    // If your string is not null-terminated
    String (CharPointer_UTF16 (data), CharPointer_UTF16 (data + dataSize));

You might have to manually preprocess your string if it doesn’t meet Unicode spec.

@oli1 - thanks, that doesn’t seem to work.

What I ended up doing, not sure if it’s a great solution or bulletproof, was to manually convert the ellipsis to a 3-byte UTF8 ellipsis and handle any other weird characters by replacing with a space (so I can maybe handle them as well if I have to):


variables:
char theTempName[256];
StringArray menuList;

[snip...]

        if (CharPointer_ASCII::isValidString (theTempName, std::numeric_limits<int>::max()))
            menuList.add(theTempName);
        else
        {
            char* c;
            size_t n, len = strlen (theTempName);
            for (n = 0, c = theTempName; n < len; n++, c++)
            {
                if ((unsigned char) *c == 0xc9)  // ellipsis
                {
                    // move the portion after this byte down by 2 bytes
                    jassert(len < sizeof(theTempName) - 2);  // enough room?
                    memmove(theTempName + n + 3, theTempName + n + 1, len - n);
                    // replace the 3 bytes with the UTF8 code for ellipsis
                    *c = '\xe2'; c++;
                    *c = '\x80'; c++;
                    *c = '\xa6';

                    len += 2;   // we "inserted" 2 bytes
                }
                else if ((unsigned char) *c > 127)
                {
                    *c = ' ';   // replace others with a space
                }
            }
            menuList.add(CharPointer_UTF8(theTempName));
        }

Trying this:

                menuList.add(CharPointer_UTF16(theTempName));

… just gives me an error, I tried different things, couldn’t make it work:

No matching conversion for functional-style cast from 'char [256]' to 'juce::CharPointer_UTF16'

CharPointer_UTF16 expects the source to be of 16bit width (uint16_t). I assumed UTF16 from your snippet, but it looks like it could be a non-standard UTF8 variant.

Unicode is a bit annoying; you can store all variants of Unicode text in a char array.

Your method is probably the best way forward.

It’s most likely an old 8-bit format where 0-127 is ASCII and 128-255 is whatever they felt like.

UTF-8 only keeps the 0-127. You can either translate the values >= 128 from the original format into their UTF-8 equivalents, or strip them out.

That table you found actually has the mappings from MAC-IS to UTF-16, so you’re already halfway there, just need to implement that table and use UTF-8 instead of UTF-16.