This is an offshoot from the other ongoing discussion about mp3 readers.
I’ve been doing much of my digital signal processing recently from within Juce, so it was interesting to look at a completely different way of doing it, that being the mpg123 people’s solutions.
Juce’s layout has each track a separate block of contiguous memory, whereas mpg123 interleaves stereo samples. If you mapped these to arrays in memory, you’d get definitions that looked like this:
SampleType* parallel_samples[CHANNELS];
SampleType interleaved_samples[SAMPLE_COUNT][CHANNELS];
Note that the difference isn’t just that there are two arrays, one being the transposition of the other - these are actually fundamentally different data structures. The Juce data structure is a short 1-dimensional classic C array, one entry per channel, each entry being a pointer into heap memory; the mpg123 data structure is a long 2-dimensional classic C array, a “rectangular” block of memory.
One of the things that makes modern processors so fast is their caches - memory is so slow that the processor can often speed things up dramatically by preloading data into the cache that it anticipates that it will soon need.
Another way to get dramatic speed increases is by using processor optimizations like SSE which basically give you a free vector/parallel processor - if the compiler is able to figure out how to do it.
And we all hate disk thrashing, cause of many a “wait”.
Let me introduce a key concept when you’re trying to do fast digital signal processing, and that is locality - something that’s deeply implicated in all three of the above issues - particularly sequential locality and spatial locality.
Let’s start with the processor’s cache. The value of your cache is how often it saves a trip to memory. There are only two ways a piece of data could be in memory - if it’s recently been requested, or if the processor has prefetched it.
There’s only so much you can get out of caching previously visited values. And the compiler is trying its best to put these values into registers, which is better, but makes this strategy less effective.
Far more effective is for the processor to optimistically pre-fetch data - memory is so slow, but reads quite a bit faster in sequence, so it makes sense for the processor to detect that you are reading a sequential block of memory and get the next big chunk of it - you lose nothing if this data is never used and if it is, you score a big win.
“Sequential locality” is just a fancy word for “having the data you process all mostly in sequential order”. I hope the explanation of the cache makes it clearer to you how sequential locality makes your cache perform better.
Now, you never know whether the compiler is going to be smart enough to pop out MMX or SSE or whatever, or whether the processor is going to be able to read it - but you know that this does actually work for at least game developers, who clearly reap serious benefits from their use of the extended instruction sets.
The advantage of modern compilers like GCC is that they’ll work this all out for you if they work out that it’s possible. However, they’re going to be much better at doing this if you code is, yes, sequentially local - if it’s able to see that you’re simply jumping through memory sequentially performing the same operations it can parallelize your code.
The mpg123 developers are clearly very interested in getting performance out of the extended instruction sets - they even have machine language code for them at points.
And again, sequential locality and spatial locality reduce disk thrashing due to memory paging. If you are accessing your data sequentially, you are almost certain that if you fetch a page you will use all of it; and your memory page use workflow is very simple, you request a page, use it continuously for a while, and then don’t touch it, which is very good for the swap processor.
So now we have all this under our belts, let’s look at the two formats again and compare them on a simple job; processing stereo into mono.[code]
for (int64 s = 0; s < SAMPLE_COUNT; ++s)
outSamples[s] = int((parallel_samples[0][s] + long(parallel_samples[1][s]))/ 2);
for (int64 s = 0; s < SAMPLE_COUNT; ++s)
outSamples[s] = int((interleaved_samples[s][0] + interleaved_samples[s][1]))/ 2);
[/code]These are equivalent to the following constructions:[code]
parallel_samples[0][s] == *( *parallel_samples + s);
parallel_samples[1][s] == *( *(parallel_samples + 1) + s);
interleaved_samples[s][0] == *(interleaved_samples + s * 2);
interleaved_samples[s][1] == *(interleaved_samples + s * 2 + 1);[/code]
The code looks very similar - but in fact there’s a lot more going on in the first loop. There are actually three places in memory be read: the block of memory called “parallel_samples”, which is an array of just two pointers; parallel_samples[0], a block of memory containing the left track; and parallel_samples[1], the third block of memory, containing the right track.
In the second loop, there’s only one block of memory being read: interleaved_samples, which is a contiguous array of samples in memory. There’s only one hotspot in memory, and it’s twice as long. Optimistic cache reads, page swaps are twice as effective.
Worse, consider the quandary that the poor optimizer finds itself in in the first case. Think of the big gains it could make if it could extract out parallel_samples[0] and parallel_samples[1] into registers and simply increment them each time! But unfortunately, unless the compiler has some way to know that no one else is touching parallel_samples[0] and parallel_samples[1], this is an incorrect optimization - it has to perform the dereference each time. The issue is called aliasing - there’s a compiler directive to point out that memory is unaliased, but I’ll bet most compilers can’t do the double optimization so well anyway.
Now, consider how much happier the optimizer is in the second case. The compiler can easily extract the common term interleaved_samples[s] into a single register and increment it, and extract outSamples into another register and increment that.
And once it’s done that, is this not an obvious candidate for optimizing for extended instruction sets? Again, you’re never sure if you’re going to benefit from these optimizations, but practice shows that if you compile for these you see a definite performance increase on quite a lot of machines. I’d think if any possible chunk of code could be optimized to be run in parallel, this code could.
In the parallel case, the aliasing issue probably prevents any such optimization from occurring - and if it did, it would require twice as many registers, in this case where we’ve hard-coded stereo (the numbers 0 and 1).
If there were more than two channels, or if the number of channels were not hard-coded, then this optimization and the register optimizations are of course right out for the parallel case.
There’s a final point, and that’s that the heap is not necessarily involved at all in the construction of an interleaved sample. This is nice and gives you the following lovely bonus - you can actually specify a sample list as a compile-time constant which means that it gets loaded into memory for with the rest of your program and then plays instantly thereafter with no consumption of memory at all just copy data right from static memory to output stream. This is very handy for key sounds like errors that you want to always work instantly. (You’d have to write a little code to output your sample as a big C array definition, but people do that all the time…)