utf8


#1

I'm getting a repeatable attempt to write to 0x0000 from this bit of code in juce_mac_Fonts.mm.

It appears to happen for some strings and not others.  A string I can use to repeat the crash is shown below. 

    float getStringWidth (const String& text) override

    {

        float x = 0;


        if (ctFontRef != nullptr && text.isNotEmpty())

        {

            CFStringRef cfText = text.toCFString();

            CFAttributedStringRef attribString = CFAttributedStringCreate (kCFAllocatorDefault, cfText, attributedStringAtts);

What appears to happen is that cfText is null.   This is a 64-bit debug build without optimisation. 

CFAttributedString is called and that then calls CFStringCreate copy with two null arguments.  The first of which is presumably the default allocator, but the second of which ought to point to the string.

(lldb) p cfText

(CFStringRef) $4 = 0x0000000000000000

However calling it from the debugger. 

lldb) p text.toCFString()

(CFStringRef) $3 = 0x000063000006d900 @"Gold Fr\tquence 3 - www.frequence3.fr"

Which is unhelpful.

____ Update. 

Right.  After some work today looking at a crash it turns out that I was creating strings without valid UTF8 in them.  Don't trust the internet. 

But looks like there are a series of minor problems: 

  • fromUTF8 only checks for UTF8 errors if a string length is passed to it.  If you just use the 'carry on to a null' it behaves differently. 
  • Apple happily returns a null ptr from CFAttributedString if it doesn't get valid UTF8. 

​So I modified some code from CharPointer_UTF8 to generate validated and error-corrected UTF8 strings.  It might be useful if you are getting data from sources where it might be corrupt and/or untrusted: 

/**

 @param source - a null terminated string containing data with possibly dubious UTF8

 @return - a JUCE String containing valid UTF8.


 Takes a UTF8 string replacing occurences of bad codes with an

 inverted questionmark 0xFFFD.


 See http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt


 For the UTF stress test.  This passes perfectly.

 */

String fromUTF8WithValidation(const char * source)

{

    String cleanString;


    while (*source != 0)

    {

        const signed char byte = (signed char) *source;


        if (byte >= 0)

        {

            cleanString += (juce_wchar) (uint8) byte;

            source++;

            continue;

        }


        if ((byte & 0xC0) != 0xC0) /* Otherwise it's an invalid continuation. */

        {

            cleanString += String::fromUTF8("\xEF\xBF\xBD");

            source++;

            continue;

        }


        uint32 n = (uint32) (uint8) byte;

        uint32 mask = 0x7f;

        uint32 bit = 0x40;

        size_t numExtraValues = 0;


        while ((n & bit) != 0 && bit > 0x10)

        {

            mask >>= 1;

            ++numExtraValues;

            bit >>= 1;

        }


        n &= mask;


        bool validCharacter = true;


        /* Remaining possibilities for failure: too many or two few continuation bytes, including the

         possibility of running out of the buffer. */


        for (size_t i = 1; (i <= numExtraValues); ++i)

        {

            const uint8 nextByte = (uint8) *source++;


            if ((nextByte & 0xc0) != 0x80)

            {

                validCharacter = false;

                break; /* We might have a null so need to get of out this loop. */

            }


            n <<= 6;

            n |= (nextByte & 0x3f);

        }

        /* Check we haven't got continuation bytes going on...and on... */


        while ((*source & 0xc0) == 0x80)

        {

            validCharacter = false;

            source++;

        }


        if (validCharacter)

            cleanString += (juce_wchar) n;

        else

            cleanString += String::fromUTF8("\xEF\xBF\xBD");

    }

    return cleanString;

}

I found a UTF-8 stress test file and this now passes with flying colours. 

However it looks like there are other problems, including some slightly strange behaviour with creating StringArrays from Files containing broken UTF8.  It stops on nulls rather than the end of the file for a start, but then seems to also stop with some other types of corrupt characters.  i don't know if that's likely to be a problem but might be worth mentioning in the docs. 

cheers! Jim. 

 

PS. Never trust a debugger.