Best layout for samples in memory?

This is an offshoot from the other ongoing discussion about mp3 readers.

I’ve been doing much of my digital signal processing recently from within Juce, so it was interesting to look at a completely different way of doing it, that being the mpg123 people’s solutions.

Juce’s layout has each track a separate block of contiguous memory, whereas mpg123 interleaves stereo samples. If you mapped these to arrays in memory, you’d get definitions that looked like this: SampleType* parallel_samples[CHANNELS]; SampleType interleaved_samples[SAMPLE_COUNT][CHANNELS]; Note that the difference isn’t just that there are two arrays, one being the transposition of the other - these are actually fundamentally different data structures. The Juce data structure is a short 1-dimensional classic C array, one entry per channel, each entry being a pointer into heap memory; the mpg123 data structure is a long 2-dimensional classic C array, a “rectangular” block of memory.

One of the things that makes modern processors so fast is their caches - memory is so slow that the processor can often speed things up dramatically by preloading data into the cache that it anticipates that it will soon need.

Another way to get dramatic speed increases is by using processor optimizations like SSE which basically give you a free vector/parallel processor - if the compiler is able to figure out how to do it.

And we all hate disk thrashing, cause of many a “wait”.

Let me introduce a key concept when you’re trying to do fast digital signal processing, and that is locality - something that’s deeply implicated in all three of the above issues - particularly sequential locality and spatial locality.

Let’s start with the processor’s cache. The value of your cache is how often it saves a trip to memory. There are only two ways a piece of data could be in memory - if it’s recently been requested, or if the processor has prefetched it.

There’s only so much you can get out of caching previously visited values. And the compiler is trying its best to put these values into registers, which is better, but makes this strategy less effective.

Far more effective is for the processor to optimistically pre-fetch data - memory is so slow, but reads quite a bit faster in sequence, so it makes sense for the processor to detect that you are reading a sequential block of memory and get the next big chunk of it - you lose nothing if this data is never used and if it is, you score a big win.

“Sequential locality” is just a fancy word for “having the data you process all mostly in sequential order”. I hope the explanation of the cache makes it clearer to you how sequential locality makes your cache perform better.

Now, you never know whether the compiler is going to be smart enough to pop out MMX or SSE or whatever, or whether the processor is going to be able to read it - but you know that this does actually work for at least game developers, who clearly reap serious benefits from their use of the extended instruction sets.

The advantage of modern compilers like GCC is that they’ll work this all out for you if they work out that it’s possible. However, they’re going to be much better at doing this if you code is, yes, sequentially local - if it’s able to see that you’re simply jumping through memory sequentially performing the same operations it can parallelize your code.

The mpg123 developers are clearly very interested in getting performance out of the extended instruction sets - they even have machine language code for them at points.

And again, sequential locality and spatial locality reduce disk thrashing due to memory paging. If you are accessing your data sequentially, you are almost certain that if you fetch a page you will use all of it; and your memory page use workflow is very simple, you request a page, use it continuously for a while, and then don’t touch it, which is very good for the swap processor.

So now we have all this under our belts, let’s look at the two formats again and compare them on a simple job; processing stereo into mono.[code]
for (int64 s = 0; s < SAMPLE_COUNT; ++s)
outSamples[s] = int((parallel_samples[0][s] + long(parallel_samples[1][s]))/ 2);

for (int64 s = 0; s < SAMPLE_COUNT; ++s)
outSamples[s] = int((interleaved_samples[s][0] + interleaved_samples[s][1]))/ 2);
[/code]These are equivalent to the following constructions:[code]
parallel_samples[0][s] == *( *parallel_samples + s);
parallel_samples[1][s] == *( *(parallel_samples + 1) + s);

interleaved_samples[s][0] == *(interleaved_samples + s * 2);
interleaved_samples[s][1] == *(interleaved_samples + s * 2 + 1);[/code]

The code looks very similar - but in fact there’s a lot more going on in the first loop. There are actually three places in memory be read: the block of memory called “parallel_samples”, which is an array of just two pointers; parallel_samples[0], a block of memory containing the left track; and parallel_samples[1], the third block of memory, containing the right track.

In the second loop, there’s only one block of memory being read: interleaved_samples, which is a contiguous array of samples in memory. There’s only one hotspot in memory, and it’s twice as long. Optimistic cache reads, page swaps are twice as effective.

Worse, consider the quandary that the poor optimizer finds itself in in the first case. Think of the big gains it could make if it could extract out parallel_samples[0] and parallel_samples[1] into registers and simply increment them each time! But unfortunately, unless the compiler has some way to know that no one else is touching parallel_samples[0] and parallel_samples[1], this is an incorrect optimization - it has to perform the dereference each time. The issue is called aliasing - there’s a compiler directive to point out that memory is unaliased, but I’ll bet most compilers can’t do the double optimization so well anyway.

Now, consider how much happier the optimizer is in the second case. The compiler can easily extract the common term interleaved_samples[s] into a single register and increment it, and extract outSamples into another register and increment that.

And once it’s done that, is this not an obvious candidate for optimizing for extended instruction sets? Again, you’re never sure if you’re going to benefit from these optimizations, but practice shows that if you compile for these you see a definite performance increase on quite a lot of machines. I’d think if any possible chunk of code could be optimized to be run in parallel, this code could.

In the parallel case, the aliasing issue probably prevents any such optimization from occurring - and if it did, it would require twice as many registers, in this case where we’ve hard-coded stereo (the numbers 0 and 1).

If there were more than two channels, or if the number of channels were not hard-coded, then this optimization and the register optimizations are of course right out for the parallel case.

There’s a final point, and that’s that the heap is not necessarily involved at all in the construction of an interleaved sample. This is nice and gives you the following lovely bonus - you can actually specify a sample list as a compile-time constant which means that it gets loaded into memory for with the rest of your program and then plays instantly thereafter with no consumption of memory at all just copy data right from static memory to output stream. This is very handy for key sounds like errors that you want to always work instantly. (You’d have to write a little code to output your sample as a big C array definition, but people do that all the time…)

You know, most modern CPU have multiple cache line (and also convert most of the instructions in micro-code and reorder them to avoid memory latencies).

That said, there is no execution time difference between this:

a = [... from malloc(size) ]
b = [... from malloc(size) ]
c = malloc(size);
for (int i = 0; i < size; i++)
    c[i] = (a[i] + b[i]) / 2;  // This is the same as your parallel example, once the first a & b pointer are computed

And this:

a = malloc(size * 2);
c = malloc(size);
for (int i = 0; i < size; i++)
    c[i] = (a[i*2] + a[i*2+1]) / 2; // This is the same as your interleaved example, once unfold by the compiler

It’s very likely the compiler will emit prefetch instruction for the loop, but even if it doesn’t do so, the CPU will hit a bottleneck at first iteration, load a bench of values from a, b, and c into its caches, and from then on the performance will be the same.

Please notice that the last code is a bit harder to maintain, because, as you said, it’s a type (2D continuous array) (not an array of pointer to a 1D array). So when you have 3 channels, or more, you have to create similar code for handling this particular type.

Compilers do have multiple cache lines - but only a fairly small number. Remember, you’re not the only cache user… if you’re writing a DSP plug-in, there might easily be dozens of other computations running at the same time, perhaps hundreds.

The difference in cost between “one cache line” and “one cache line per channel” becomes quite significant then, even if you’re only working in stereo.

I believe that your example ignores aliasing.

In order to get at the memory for a parallel track, you need to do TWO memory dereferences - one to look at the start of that track, and then one to get the sample for that track. But unfortunately, the compiler cannot cache that first memory dereference - because that pointer might have changed - not in another thread, which would be OK (it’s fine if different threads have different views of memory - they get synchronized when you hit a mutex), but rather that the compiler doesn’t know that the output track doesn’t point to the same memory location as the list of track pointers.

Now, a person of course looks at that code and says, “Well, of course the programmer wouldn’t put his sample buffer in the same place as his list of sample pointers” but the compiler has no way to know this without a compiler directive informing it that there is no aliasing going on. Specifically, in

for (int64 s = 0; s < SAMPLE_COUNT; ++s) outSamples[s] = int((parallel_samples[0][s] + long(parallel_samples[1][s]))/ 2);the compiler has no way to know that at runtime, one of these outSamples[s] assignments isn’t going to overwrite parallelSamples[0] or parallelSamples[1]. You’d be an idiot to write code like that here, sure, but in some cases you write perfectly correct code which does use aliasing, and the language allows it, so the compiler cannot rule it out.

Now, you can instruct the compiler that aliasing cannot occur, using the restrict type qualifier - and you can see this quite a lot of the time on clib functions, for example:void * memcpy(void *restrict s1, const void *restrict s2, size_t n); - but the restrict qualifier not respected by some (many, most, any?) modern compilers - in the example of memcpy() above, it’s more of a warning to programmers that s1 and s2 can’t overlap, and at least on my system, memcpy() is accomplished with hand-written assembly that ignores the issue of aliasing.

For this very reason, when I process Juce-style audio arrays, I always make a point of doing them first by channel, then by sample; and I extract out the pointer to the base of the sample as a local variable. Now there’s no possible aliasing problem and the optimizer can almost certainly extract that local variable into a register.

I’m not quite sure I’m getting you here. As far as the programmer is concerned, they’re writing exactly the same code to access an array:[code]for (int64 s = 0; s < SAMPLE_COUNT; ++s)
for (int c = 0; c < CHANNELS; ++c)
process(parallel_samples[c][s]);

for (int64 s = 0; s < SAMPLE_COUNT; ++s)
for (int c = 0; c < CHANNELS; ++c)
process(interleaved_samples[s][c]);[/code]
Under the covers, there are two different operations going on here - the first one has two memory dereferences, the second has one - but the application programmer doesn’t need to know about it.

Yes, you can’t just pass a pointer to a single track’s raw data - but I at least don’t seem to have that in my code at all, because I always want to have the length of the track if not other data, so I end up passing an entire AudioSampleBuffer every time.

I didn’t start to write this article with the intention of bashing Juce’s, er, Jules’ excellent system that is providing me with so many programming facilities! Instead, I was looking through the mpg123 package and trying to figure out why they had the memory arranged in a way I didn’t expect… then I did some analysis and concluded that their method of arranging data in memory was primarily for convenience in sending and receiving streamed data, and secondary because of sequential and spacial locality.

It’s pretty rare in your coding life that you’ll really be concerned with details like this - because it’s generally pretty rare that your application is CPU-bound, nearly all real-world applications these days are in fact memory-bound. But DSP is one of those areas.

When the big guys write code for maximum speed, they do various analyses, for locality, and even easier, just counting memory hotspots, how many areas in memory are being heavily used at the same time. Fewer hotspots is better - because you get fewer page swaps, better cache performance… No matter what, you can’t get around the fact that you have at least one more hotspot per channel in the parallel format - and you might do a lot worse than that if the optimizer can’t get past that aliasing issue.

This was more to satisfy my curiosity but eventually I’ll experiment with some benchmarks… probably this summer.

A little experiment to check how well the cache works. The code below is by no means a formal benchmark but the results might be illuminating.

Paste the code below into a file called test.cpp and then type at the command line:

g++ test.cpp -o test ./test

g++ test.cpp -o test -O2 ./test

[or use your favorite generic C++ compiler…]

The first one shows the results of sequential locality for an unoptimized build; the second for an optimized build. Try it, and I’ll post my results later (I don’t want to spoil the surprise…)

[Sorry for not including this as an attachment but I got a series of errors: “the extension cpp is not allowed”, “the extension txt is not allowed”]

[code]#include <time.h>
#include <stdio.h>
#include <stdlib.h>

typedef long long int int64;

static const int64 PRIME1 = 51234577L;
static const int64 PRIME2 = 20000003L;

int64 test(bool linear) {
char* buffer = (char*) malloc(PRIME1);
int64 time = -int64(clock());

if (linear) {
for (int64 i = 0; i < PRIME1; ++i) {
int64 nonlinear = (i * PRIME2) % PRIME1;
buffer[i] = char(nonlinear);
}
} else {
for (int64 i = 0; i < PRIME1; ++i) {
int64 nonlinear = (i * PRIME2) % PRIME1;
buffer[nonlinear] = char(nonlinear);
}
}

time += clock();
free(buffer);

return time;
}

int main() {
printf(“nonlinear: %lld\n”, test(false));
printf(“linear: %lld\n”, test(true));
}[/code]

restrict is one thing, what the CPU does is another thing.
Even with the worst compiler with no optimisation at all, dereferencing a pointer in the cache virtually cost nothing on the CPU (since the micro code will move previous / next instructions while the cache stalls). Since the address of the channel’s sample array will likely be close in memory, the first access will be slow (a good compiler can prefetch this within the previous code), but any further access in the loop will be almost instantaneous.

Under x86, the cache is indexed by PIPT (physical index), so whatever the address you’ll use (can comes from anywhere), if it’s in cache in any thread, it’s “magically” available to the cache of the current processing unit.
Under ARM, the cache is indexed by VIPT (virtual index), so your point might be true for this processor.

Cache line nowadays are relatively big, and I really doubt that you’re processing more than 4k samples at a time. Even with float, and with 16 channels, that still fit in the L1 cache of any modern CPU.
I would matter more on the limited register count available on x86 or on the limited number of ALU, than I would on the memory.
In any recent x86 CPU, you don’t have more than 4 ALU, and 8 useable registers (12 on x64).
In the trivial example I’ve sent earlier, memory would be the bottleneck, but as soon as you’ve add/mul/clamp & test, memory limit disappear very fast, bottleneck becomes the processing power.

In all case, profiling will judge who’s right. The parallel pattern is, in theory, slower, but since it’s used everywhere, I hardly doubt it’ll be in practice way faster.

For example, in my app I’m using files where I have 3 or 5 channels. How can I write a function handling them ?
The type of the buffer is for example “float [sampleCount][channelCount]”. Whenever the sampleCount change or the channels change (depending on the layout you prefer), you’ve a new type, where you can’t convert to/from without a function.
I’m not going to overload my function for all possible cases (sampleCount or channelCount).

That’s the format most audio decoder/encoder use (like Lame / OggVorbis). Using this means less code on your side to convert to/from.
Again, as soon as you don’t fit the exact same configuration as them, you’ve to extract data, and you’re screwed.
Please notice that this format is used often too:

SampleType array[channels * sampleCount];

But it has the intrinsically issue that you can’t process this easily (since samples are interleaved).
I’m almost sure mpg123 is extracting the data internally to process them, since the audio channel are processed almost independantly in MP3 > 128kbits/s.

To me, all real-world applications processing audio are CPU bound. Look at the CPU graph of a running audio processing application, the CPU peaks at 100% immediately.
You’ll have difficulties saturating your memory bus with audio samples!
I’m doing video work, and I do reach memory bandwidth limitation on video processing, and in that case the CPU doesn’t peak, the I/O doesn’t peak, but the application can’t go faster. Immediately, all other applications starts to slow down.
A memory bus on a recent computer can handle 2Gb/s with no issue, I’m processing 5MP pictures at 25fps on a Core2duo.

Just posted the results of my little benchmark (which doesn’t negate at all the interesting follow-up in this thread, which I’ll have to get to tomorrow as I’m out of time…)

(Note that this isn’t testing interleaved vs parallel but just testing the gross effects of linearity and optimization on a simple case…)

I’ve hard time understanding your test. What I was speaking about what this:

#include <time.h>
#include <stdio.h>
#include <stdlib.h>

typedef long long int int64;

static const int64 PRIME1 = 8192L;

int64 testP()
{
  char * a = new char[PRIME1];
  char * b = new char[PRIME1];
  char * c = new char[PRIME1];
  char * arr[2] = {a, b};
  int64 time = -int64(clock());
  for (int i = 0; i < 100000; i++)
  {
    for (int j = 0; j < PRIME1; j++)
    {
       c[j] = arr[0][j] + arr[1][j];
       arr[0][j] = c[j]; // Need to modify the source to prevent the compiler from optimizing the loop
    }
  }
  time += clock();
  return time;
}
int64 testI()
{
  char* a = new char[PRIME1*2];
  char *c = new char[PRIME1];
  int64 time = -int64(clock());
  for (int i = 0; i < 100000; i++)
  {
    for (int j = 0; j < PRIME1; j++)
    {
        c[j] = a[j*2] + a[j*2+1];
        a[j*2] = c[j]; // Need to modify the source to prevent the compiler from optimizing the loop
    }
  }
  time += clock();
  return time;
}

int main() {
  printf("nonlinear: %lld\n", testP());
  printf("linear: %lld\n", testI());
  return 0;
}

Both test run in the same time for me:
g++ -o t t.cpp ./t
nonlinear: 5980000
linear: 6080000

I’m sorry, I wasn’t clear! My first test has nothing to do with the parallel vs. interleaved controversy. I wanted to write a benchmark that simply tested how important sequential locality was in the simplest possible test.

Did you try my benchmark? I think you’ll be surprised…

Now, I think your test is too easy. Let’s start with this comment…
// Need to modify the source to prevent the compiler from optimizing the loop

Surely the idea is exactly to see if the compiler optimizes the loop? Isn’t that my argument?

I’d go further and say that the case where we read from one set of samples and write to another accounts for 95% or more of the DSP code ever written. I can’t personally remember ever writing code where I modified both my input and my output streams at the same time.

Now, let’s go back and look at that aliasing issue. You have char * a = new char[PRIME1]; char * b = new char[PRIME1]; char * c = new char[PRIME1]; char * arr[2] = {a, b}; but this means that the compiler now knows that there is no aliasing issue with arr[0] and arr[1], because arr is a local variable (and probably allocated on the stack - but regardless, the compiler knows that you can never get an alias to it).

So this might make the parallel strategy look better than it is.

BUT, there’s also the issue that you’re allocating two separate blocks in memory for the channels, whereas Juce goes one better by allocating all the samples with one malloc (see the method allocateData()) and then handing out pointers to segments of that block see - so that’s perhaps making the non-interleaved look worse.

I wrote a new benchmark to take these two factors into account and put it here: http://ax.to/benchmark - and you can see both benchmarks including the previous one here: http://ax.to/benchmarks with a Makefile.

Remove that part if you want, it doesn’t change the results in anyway. I’ve put this here because a smart compiler could just remove the all the next 9999 iterations since they don’t do anything different than the first one.
GCC doesn’t perform this optimization anyway.

When I wrote this test I thought about the aliasing, and tried this: “extern char * getA();” “extern char * getB();” and moved the “char * a and b” allocation in a different file. The results are almost the same (within 5% better for interleaved).

I’ve tried them and got the results you’re expecting, that is random access to memory each loop is slower than sequential access.
What I don’t understand is where do you find this pattern in real code ?
What I was saying is that the first access to a new memory stride is slow, but any later access is as fast as if you did interleaved. Your test show the penalty influence (which is not that big) as you’re dealing with different memory address each iteration.

There are two separate benchmark-like tests.

One of them tests randomly accessing a large block of memory vs sequentially access - I was just curious but I was pretty surprised that the sequential code was TEN times as fast as the random access code. I went out of my way to make sure that all the calculations were the same and that the compiler couldn’t optimize any of them out…

The idea was not that real code goes out of its way to randomly wander through memory - it’s to show how much of an advantage your cache really gives you IF you can arrange to visit the code sequentially. I would however point out that code that has a large number of small objects will quite likely exhibit similar characteristics…

The second benchmark compares interleaved vs parallel, duplicating as nearly as possible the data structures used by Juce and mpg123. The parallel code takes between 25% and 100% more clock cycles to complete on my machine, depending on various settings.

Putting your pointers into a function absolutely doesn’t prevent them from being optimized out, or the compiler from figuring out that they can’t be aliased. As near as I can see, the only way to guarantee that is to put those pointers into a spot where someone else could change them - i.e., put them into an array.

This is wrong. The function being declared as extern in t.cpp, and implemented in v.cpp (i.e. not visible to the compiler when compiling t.cpp), the compiler can’t figure out the memory layout and must handle them as if they could be aliased.

After all, v.cpp could do “char * getA() { return char * getB(); }”

Thanks for starting this thread! JUCE is great as an easy-to-use framework, but I’ve often wondered what sort of optimizations we may be losing out on by all the Java-like abstractions (and without the possibility to JIT). Just to point it out since I haven’t seen it yet, there is also potential big gain to be had by making sure your blocks are properly aligned. For instance, in standard x86, its possible to index memory on the byte level, but when copying to the SIMD registers, it needs to be word-aligned (for whatever particular definition of word… yes, this has been strangely corrupted by Intel over the years) If the words you’re indexing (32-bit float for example) are split across two aligned sections, the compilers will generate additional copy instructions in order to seamlessly move it into the right place. Paying attention to these little tricks can often result in order of magnitude speedups! It would be great to have JUCE take care of as much as possible (and I’m sure it already does a fair bit of it)… perhaps this thread could be the start of a ‘recommended practices’ for this area?

JIT? Huh? It’s compiled C++. Your JIT won’t even get near that, would it?

Apart from the fact that juce’s abstractions all work ‘with’ C++, not attempt to ‘fix’ it, like Qt etc etc.

Bruce

Thank you so much for giving us a informative concpet. I do admire the effort and eager to know more.