Character encoding in WavAudioFormat


#1

I've been adding some stuff to read the ListInfoChunk in WavAudioFormat since the current version in Juce only seems to write this chunk.  Whilst testing it with a wav file generated by SoundForge, I've noticed that there are some character encoding issues.  In my particular example, a copywrite symbol '©' appears in the reader's metadata StringArray as a close bracket ')'.

The RIFF/Wav specification indicates that the default encoding for text should be ISO 8859/1 unless overridden by the presence of a CSET chunk.  In my case, the CSET chunk is not present, but even if it was then it can only specify alternative code pages (and not utf-8 for example).  I doubt many applications do bother to write it anyway.

The problem is that the WavAudioFormat class treats most text from the wav file as UTF-8 since it tends to call MemoryBlock::toString().

I've kludged in some dirtyness (below) for the time being, on the basis that assuming ISO 8859/1 (ie. not bothering to check for a CSET chunk) is better than a kick up the proverbials.  I've still to handle going back the other way (UTF-8 -> ISO 8859/1 in the file writer) and I'm not sure about what the deal is with AIFF files yet.

Anyway, this is mainly to bring it to your attention - not suggesting it should go into Juce in this manner, probably best in the String classes or tucked away in the AudioFormat class where such dirt can do no harm!

 

// Attempts to parse the contents of the block as a zero terminated ISO-8859-1 string
// The returned string will be UTF-8 encoded.
String MemoryBlock::toStringFromISO8859() const
{
    // Create a second memory block for converting to UTF-8 (worst case, 
    // this may have to be up to twice the size of the ISO-8859-1 string)
    MemoryBlock UTF8Text(size * 2 + 1);
    const u8 *pIn = static_cast<u8*>(getData());        // ISO-8859-1 in
    u8 *pOut = static_cast<u8*>(UTF8Text.getData());    // UTF-8 out
    
    for (size_t i = 0; ((pIn != nullptr) && (*pIn != 0) && (i < size)); ++i)
    {
        if (*pIn < 128)
        {
            *pOut++ = *pIn++;
        }
        else
        {
            *pOut++ = 0xc2 + (*pIn > 0xbf);
            *pOut++ = (*pIn++ & 0x3f) + 0x80;
        }
    }
    
    // Null terminate
    *pOut = 0;
    return String (CharPointer_UTF8 (UTF8Text.data), UTF8Text.size);
}


#2

Yeesh.. Thanks for the heads-up, we'll take a look at that!


#3

I've had a look at the spec for AIFF and it only mentions plain ASCII.   I haven't been able to test this out with a Soundforge generated file because I can't find a way of entering any string metadata in Soundforge which it's willing to save in an AIFF.