Android performance issues

Good day colleagues,

For the second week I have been struggling with android and its performance is very confusing for me.

On a clean project, without any effects, in release mode, with all possible optimizations and even a large buffer of 1920 samples, the CPU core goes over 100% and android hangs into eternity. Below is pseudocode that blindly simulates some kind of complex activity.

And I got the following results:
Average CPU load:
Apple M1- 25-27%
Apple A12X - 35-45%
A10 Fusion - 60.4-61%
Intel x64 i5 - 60-70%
Snapdragon 855 - more than 100%, even if I use 512 cycles. Sound is stable only if 256 cycles.

OFast
ffast-math
LTO
release

It looks like some kind of wildness for me :(. The difference 16 times slower on android? Why the apple’s processor can handle the 4096, even 16384 cycles without any problem, but snapdragon 855 only 256 ? Or this is something with Google’s OBOE ?

Thank you!

Pseudo code:

inline void processBlock(AudioBuffer<float> &buffer, int length) noexcept {

            float outputBuffer[2][length];
            for (int i = 0; i < length; ++i) {
                outputBuffer[0][i] = 0.0f;
                outputBuffer[1][i] = 0.0f;
            }

            // oversampling 4x
            for (int s = 0; s < 4; ++s) {

                // 16 notes/voices
                for (int v = 0; v < 16; ++v) {

                    // 4 oscillators
                    for (int o = 0; o < 4; ++o) {

                        // 16 unison for each osc
                        for (int u = 0; u < 16; ++u) {

                            // length 44100/48000 samples in one second
                            for (int i = 0; i < length; ++i) {

                                auto &pos = phaseAccumulator_[v][o][u];

                                // get amplitude
                                auto sound = waveSample_[(int) pos];

                                // fill buffer
                                outputBuffer[0][i] += sound; // channel 1
                                outputBuffer[1][i] += sound; // channel 2

                                // move phase
                                pos += 0.02321995464f * 440.0f; // hz
                                if (pos >= 1024.0f) {
                                    pos -= pos;
                                }
                            }
                        }
                    }
                }
            }

            auto *bufferRef = buffer.getArrayOfWritePointers();
            for (int i = 0; i < length; ++i) {
                auto gain = 0.0001f;
                bufferRef[0][i] = outputBuffer[0][i] * gain;
                bufferRef[1][i] = outputBuffer[1][i] * gain;
            }

        }

private:
           float waveSample_[1024]{}; // you can keep it with zero or fill with sine, square...
           float phaseAccumulator_[16][4][16]{};

p.s. I deliberately did not write it with FloatVectorOperations here, because I want to test exactly the same and basic code on different platforms.

Of course, this pseudocode looks terrible and is not optimized, but if I optimize it, the results will be better for apples and intel cpu. On M1 with float vector optimizations and some rearrange I even got 2-3% percent of the load. So my goal is not to optimize the code above, but to understand why it so slow on android.

@shannonburns It’s not what I’m asking. That’s why I mentioned that doing compare of not optimized pseudo code on different cpu’s as it is. And of course this code is terrible, but it simple loop. Even if I vectorize it… the cpu still above 100%

As I noted, in my tests A12x can load even 16384 “for” cycles. But with snapdragon 855, it’s only 255-300 cycles for me. So I’m trying to understand if this is REALLY pixel 4 has so slow cpu, or this is google’s API so bad?

Would be good and helpful, if some one that already worked a lot with android audio, gave his measurements. So far, I do not see any reason to continue even developing my plugins for android if it does not cope with such a primitive task. It seems to me that it is somehow very suspicious that the difference is almost more than 64 times. On amd, intel, apple arm the performance is perfect for me.

Thank you everyone!

Could it be simply the memory caches (L1 / L2) that seem smaller on Snapdragon 855?

Snapdragon / 64 KB / 1.8 MB
A10 / 64 KB / 3 MB
Apple A12X / 128 KB / 8 MB
Apple M1 / 128 KB / 12 MB

Anyway I never used Android, thus i could not help more…

EDIT: on Intel i5 caches are smaller (64 KB / 256 KB) and result is not the worst.

< code::dive conference 2014 - Scott Meyers: Cpu Caches and Why You Care - YouTube >

OT: If it was you @vtarasuk who flagged @shannonburns answer, please stop doing that.

I neither agree nor disagree with the statement. But the flagging mechanism is meant for answers that are against the netiquette like abusive, hate speech, inappropriate advertisement or other unwanted behaviour.

Imagine being kicked out of your pub when you say something controversial, I guess you would consider that rude as well…

That’s me that has flagged the post. I flagged it as Spam.

Ok, so apologies to @vtarasuk :slight_smile:
The post looked legit although indeed with unfounded opinions.But then it’s good conversation style to reason about it.

Anyway, moving on

@daniel Yah it was not me. No problem)

Several observations with code below (I simplified it to 1 loop).

48000, 96 buffer
loop with 1024 voices - 63.7% CPU

48000, 192 buffer
loop with 1024 voices - 31.8% CPU

48000, 288 buffer
loop with 1024 voices - 21.9% CPU

And sound is fine… but if I increase voices, then nothing is hear. Just a “click sound” during the start… and then a silence. While the CPU load shows me 60-70 %, but android os hangs, so I need to reboot the device or wait until the app will be closed.

48000, 96 buffer size:
loop with 2048 voices >= 100.0% CPU (no sound, android hangs)

48000, 192 buffer
loop with 2048 voices - 62-64% CPU (no sound, android hangs)

48000, 288 buffer
loop with 2048 voices - 41.38% CPU (no sound, android hangs)

48000, 1920 buffer
loop with 2048 voices - 6.2-10% CPU (still NO sound, android hangs)

44100, 1920 buffer
loop with 2048 voices - 9.44-12% CPU (sound fine)

44100, 384 buffer
loop with 2048 voices - 74-100% CPU (sound with some clicks)

44100, 96 buffer
loop with 2048 voices >= 100% CPU (surprisingly sound, but with some clicks at the start)

inline void processBlock(AudioBuffer<float> &buffer, int length) noexcept {

            float outputBuffer[2][length];
            for (int i = 0; i < length; ++i) {
                outputBuffer[0][i] = 0.0f;
                outputBuffer[1][i] = 0.0f;
            }

            // u - 1024 or 2048 or 4096
            for (int u = 0; u < 1024; ++u) {

                for (int i = 0; i < length; ++i) {

                    auto &pos = phaseAccumulator_[u];

                    // get amplitude
                    auto sound = waveSample_[(int) pos];

                    // fill buffer
                    outputBuffer[0][i] += sound; // channel 1
                    outputBuffer[1][i] += sound; // channel 2

                    // move phase
                    pos += 0.02321995464f * 440.0f; // hz
                    if (pos >= 1024.0f) {
                        pos -= pos;
                    }
                }
            }

            auto *bufferRef = buffer.getArrayOfWritePointers();
            for (int i = 0; i < length; ++i) {
                auto gain = 0.0001f;
                bufferRef[0][i] = outputBuffer[0][i] * gain;
                bufferRef[1][i] = outputBuffer[1][i] * gain;
            }

        }

private:
           float waveSample_[1024]{};
           float phaseAccumulator_[4096]{};

I used - getDeviceManager().getCpuUsage()

I do not know how true it shows, but it is surprising that when the load is very small, there is no sound. And if I turn off OBOE, then there is a sound but very wheezing with a lot of clicks.

OBOE - enabled
OBOE stabilized callback - enabled (actually tried both ways)
NDK toolchain - clang (default)

p.s2. The strange is on iOS or Mac when I change the buffer size, the CPU stays on the same level. i.e. no difference between 32 and 2048 buffer size

Any ideas of what can be wrong? Or that’s really a slow Google’s API/CPU ?