BUG: MP3Reader incorrectly calculates number of frames

I’ve spent some time just trying to get mp3’s playing properly but I seem to keep running into issues. CoreAudioReader has performance problems when opening VBR mp3 files (https://forum.juce.com/t/coreaudioreader-long-delay-before-streaming-mp3/18358/2) so I thought I’d take a look at Juce’s mp3 implementation as an alternative.

I’ve been comparing the number of sample frames calculated by each implementation. It seems to be a black art. For a given CBR mp3 (which should be the simplest to calculate) of ~6 min I get:

CoreAudio: 15303168
SoundForge: 15303344
**Juce MP3: 15300864**

So none of them agree, but the Juce implementation disagrees more than the others.

VBR’s are a bigger problem. Here’s the same file encoded with VBR at 44.1kHz:

CoreAudio: 15305472
SoundForge: 15304271
**Juce MP3: 13366656**

… and VBR at 48kHz:

CoreAudio: 16660224
SoundForge: 16659023
**Juce MP3: 25804800**

Looking at the implementation of MP3Reader::findLength(), it only seems to inspect the first MP3 frame header. For VBR files I’ve learnt that it needs to look at every frame header, as the bit rate varies from frame to frame.

So what I’m really asking is, what are the known limitations of the Juce MP3 implementation, and is there any chance the above will get fixed soon? I’ll do it myself if necessary but if I’m honest I’m beginning to get a bit exasperated with the amount of library testing and investigation I’ve already had to do.

The juce implementation is based on the libmpeg algorithm… I don’t claim to be an mp3 expert, or that the algorithm it uses is the best one, but as you can see, it’s one of those annoying formats where there’s no such thing as “correct” behaviour.

Best plan with mp3s (or most compressed audio formats) is to ignore the estimated length and just read them start-to-finish until they stop. In Tracktion for example, we always decompress audio files to temporary wavs before attempting to play them, which is very quick on modern machines.

I’ll be fully decompressing files for most operations (eg. editing) but I still ideally want to avoid it in some situations if possible - it’s a mobile application so decompressing a lot of files fully comes at a price in terms of ‘disk’ space.

What I can’t get a handle on with regards to ‘open ended’ streaming is that the Juce code I’ve come across always seems to expect a finite length and doesn’t have a concept of ‘length unknown’.

For example, the BufferingAudioFormat reader begins by copying the length (in sample frames) from the source reader and provides no means of saying it’s unknown. If I hack this value to zero then it plays silence, and if I hack it to s64 max then it plays forever (with silence at the end) with no means of detecting EOF.

Yes, none of the audio format classes were designed for infinite streams - I think that may be best handled as an entirely separate set of classes, but it’s not something I’ve looked into deeply.

OK fair enough Jules, I thought I might be missing something. I’ll go for a two pronged approach of making changes to support streaming, and seeing if the mp3 reader’s frame calculation can be improved without too much work.

Life was easy when I thought I could just use CoreAudio, and that neatly avoided all mp3 patent related ambiguities too. It looks to me though like it would be possible to separate MP3Player’s decoding code from its frame parsing code, so I could just use MP3Player to work out the length (faster than CoreAudio), and then use CoreAudio to do the actual playing (no patent issues).

I’ve learnt a bit more about mp3 files than I wanted to and here’s what I cooked up. Sorry, the code’s not quite Juce but should be fairly trivial to adapt and incorporate if you wish. For CBR files, it now comes up with the same answer as the CoreAudio reader, but does it in 5ms rather than the 10s I started with (https://forum.juce.com/t/coreaudioreader-long-delay-before-streaming-mp3/18358/2)

A quick explanation is that mp3 files contain chunks called ‘frames’. Each mp3 frame begins with a 4 byte header followed by a few hundred audio frames. An mp3 frame always represents a fixed period of time, so the actual number of audio frames within it depends upon the sample rate. CBR files always have the same sample rate, but VBR files will most likely have a different sample rate for each mp3 frame. It’s sometimes necessary to scan VBR file entirely and examine every header to get an accurate count, but if the encoder has been a good citizen then it should have added an extra ‘xing’ header with a TOC to make this unnecessary. CBR’s should be simpler but frames sometimes contain an extra padding slot to align sample rate multiples. Again, this means there’s no choice but to scan the whole file for a completely accurate count.

Note the implementation below doesn’t use the TOC for VBR files, as I’m intending on using CoreAudio for those.

Added to MP3Decoder::MP3Stream:

	// Calculates the total number of sample frames in the entire MP3 file.  Each MP3 frame within the file may (or may not) contain a padding 'slot', so
	// the only way to obtain a fully accurate count is to scan the entire file.  However, it is possible to only partically scan the first part of the
	// file and then infer the rest based upon the ratio of MP3 frames discovered with padding vs. frames without.  This method will be more accurate for
	// CBR (constant bit rate) files rather than VBR.
	// Note: This function may change the current stream position.
	// zAccuracy - If 0.0f, only the minimum number of MP3 frames will be scanned.  If 1.0f, the entire file is always scanned.
    s64 CalculateNumSampleFrames(f32 zAccuracy = 0.5f)
    {
		// Initial plausibility check on stream size - we'll need at least one mp3 frame
		const s64 StreamSize = stream.getTotalLength();
		if (StreamSize < 64)
			return 0;

		// zAccuracy determines the fraction of the stream we physically scan, and then we'll infer the rest based upon the statistics obtained
		const s64 ScanEndPos = (s64)(StreamSize * zAccuracy);

		// Always scan a minimum number of mp3 frames so we can obtain reasonable statistics - this is still fairly quick and means we're more likely to be accurate for small files.
                const s32 MinFramesToScan = 64;

		// Locate the first frame (from the current stream position)
		u32 FrameHeader = seekNextFrame(nullptr);
		if (FrameHeader == 0)
			return 0;

		// Location of the first valid frame (any data before this shouldn't be included in the calculation)
		const s64 FirstFrameStreamPos = stream.getPosition();

		MP3Frame Frame;
		s32 NumFrames = 0;

		// For each mp3 frame
		do
		{
			// Decode the header so we can obtain the frame's data size
			Frame.decodeHeader(FrameHeader);

			// Add to our frame count
			++NumFrames;

			// Skip past the frame's audio data
			stream.setPosition(stream.getPosition() + Frame.frameSize);

			// Search for the next frame matching the properties of the current frame
			FrameHeader = seekNextFrame(&Frame);
		}
		while ((FrameHeader != 0) && ((NumFrames < MinFramesToScan) || (stream.getPosition() < ScanEndPos)));

		// Found some valid frames?
		if (NumFrames <= 0)
			return 0;

		// Did we stop before we got to the end of the stream?
		s64 StreamPos = stream.getPosition();
		s64 NumBytesRemaining = StreamSize - StreamPos;
		if (NumBytesRemaining > 0)
		{
			// Estimate the number of frames remaining based upon the frames we scanned
			f32 MeanFramesPerByte = (f32)NumFrames / (StreamPos - FirstFrameStreamPos);
			s32 EstimatedRemainingFrames = RoundToNearestInt(NumBytesRemaining * MeanFramesPerByte);

			// Add the estimated part
			NumFrames += EstimatedRemainingFrames;
		}

		return NumFrames * 1152;
    }

	// Helper for CalculateNumSampleFrames().  Based upon scanForNextFrameHeader(), but doesn't change any state.
	// Seeks forward from the current stream position to find the start of the next mp3 header.
	// If zpLastFrame is nullptr, the function returns the first header found.  Otherwise, it will return
	// the first header found which matches the properties of the specified header.
	// Returns a frame header value, or zero if the next frame could not be found.  Valid frame headers are always non-zero.
    u32 seekNextFrame(const MP3Frame* zpLastFrame) noexcept
    {
        const s64 streamStartPos = stream.getPosition();

        int offset = -3;
        u32 header = 0;

        for (;;)
        {
            if (stream.isExhausted() || stream.getPosition() > streamStartPos + 32768)
            {
                return 0;
            }

            header = (header << 8) | (u8) stream.readByte();

            if (offset >= 0 && isValidHeader (header, (zpLastFrame == nullptr) ? -1 : zpLastFrame->layer))
            {
                if (zpLastFrame == nullptr)
                    return header;

                const bool mpeg25         = (header & (1 << 20)) == 0;
                const u32 lsf             = mpeg25 ? 1 : ((header & (1 << 19)) ? 0 : 1);
                const u32 sampleRateIndex = mpeg25 ? (6 + ((header >> 10) & 3)) : (((header >> 10) & 3) + (lsf * 3));
                const u32 mode            = (header >> 6) & 3;
                const u32 numChannels     = (mode == 3) ? 1 : 2;

                if (numChannels == (u32)zpLastFrame->numChannels && 
					lsf == (u32)zpLastFrame->lsf && 
					mpeg25 == zpLastFrame->mpeg25 && 
					sampleRateIndex == (u32)zpLastFrame->sampleRateIndex)
				{
                    return header;
				}
            }

            ++offset;
        }

        return 0;
    }

… used in MP3Reader constructor like this:

MP3Reader (…)
{
skipID3();

	// Calculate the number of frames
const s64 streamPos = stream.stream.getPosition();
	s64 NumSampleFrames = stream.CalculateNumSampleFrames(0.0f);

	// Restore the original stream position
stream.stream.setPosition (streamPos);

	if ((NumSampleFrames > 0) && readNextBlock())
{
		mFormat = ABF_F32;
    mSampleRate = stream.frame.getFrequency();
    mNumChannels = (unsigned int) stream.frame.numChannels;
}

}