I thought I'd mention that I rewrote AudioProcessorGraph a couple of weeks ago, mainly to speed up graph building but to add a few other things I wanted. Unfortunately this is for a commercial product so I can't make it available, but you reminded me that I should write down a few of the things I found before I forget it completely.
Graph build performance
With the contrived test case I was using (10,000's of nodes and connections), the graph build was well in excess of a minute. I managed to get the same test case down to around 15ms, with a more realistic graph (for me anyway, a few 100 nodes/connections) well under 1ms. I could have gone even further but there was no point once I was sub 1ms.
I see that a few people had been there before and fixed up some stuff but it really needed a few fundamental changes. One bottleneck was the addition of ordered nodes at the start which was O(n^2). This can be made O(num_nodes + num_connections) with a topological sort - I used Tarjan's algorithm but there are others.
Next main bottleneck was isBufferNeededLater() - probably worse than O(n^2). I solved this by making the render op calculator work out how many times a buffer will be needed in future at the point it allocates it - essentially, setting up a refcount. Every time the buffer is 'used' (ie. fed into the input of a later node), the refcount is decremented. Once it reaches zero that buffer's put back on the free list again.
I also found it very useful to keep the nodes/connections in a proper graph (ie. the nodes keep track of their own incoming and outgoing connections), since the current implementation just keeps them in 'global' linear lists. I originally just used binary chop/hashmaps on these global lists, but found it very useful to be able to just get the connections straight from the node (still kept the 'global' lists too though).
Run time performance
The original ordered node sort seemed to produce more of a breadth first node sort. By this, I mean that if you have a node summing point with multiple inputs (eg. 32 mixing desk strips being mixed to 1 final output), it would tend to generate each input in turn and then only sum them at the end, requiring 32 intermediate buffers. I changed this so it would process 1, mix 1, process 1, mix 1 etc. so that it doesn't require nearly as many buffers. I suspect buffer cache hits might be better too but that's half speculation.
A second problem was in a similar area, but to do with latency compensation. A common scenario I had was to have 32 channels to sum, only one or two of which had effects with non-zero latency. The original implementation would add 30 delays to the 30 or so other channels which didn't have any latency. I changed this so that it would mix these 30 together, then apply one delay to that submix. I made this more general so that it would cascade summing/delay points. eg 5 inputs with the latencies of 0ms, 0ms, 5ms, 5ms, 25ms now does this:
Sum the 0ms inputs
Apply 5ms delay
Sum with the 5ms inputs
Apply 20ms delay
Sum with the 25ms input
Feed to next node's input
Improvements
Multiple MIDI connections. My version of the AudioProcess handles multiple MIDI buffers, which I find useful from time to time. This actually made the render op calculation code simpler, because they could be treated almost identically to the audio buffers.
MIDI connection break. I made my MIDI connections keep track of the MIDI keyboard state. When a connection is removed, it 'injects' the necessary note off messages to the downstream nodes to automatically prevent hanging notes.
Latency delay re-allocation. I didn't like the way that every time the graph was rebuilt, it re-allocated and cleared out all the latency delay compensation buffers, resulting in audio nastiness for even trivial changes. I modified the delay buffer allocation so that each buffer has a unique fingerprint based upon the node, input channel index etc. When it rebuilds the graph, it tries to look up a delay buffer with the same fingerprint and continues using it if it can.
There was loads more, but this is what I remember at the moment. I'd recommend that the library version gets a full rewrite at some point - I was excited to find it initially, but that kind of turned to disappointment when I realised how slowly the graph rebuild was working on a phone, and how much logic I'd have to untangle to understand/rewrite it!