Heap memory recycling memory allocater


#1

In Multi-threaded applications, all threads using the same CriticalSection for locking the HEAP Memory, which can be a bottleneck and reason for spikes/clicks/pops/buffer-repeats while audio-playback.
This allocator recycles memory-Blocks deallocated by “delete” in its own cache per Thread
Its overloads the global new and delete operator (which is IMHO done by the linker, part of c++ standard), no header file is required
This is proof of concept, the caches won’t be cleaned up at the end, so it WILL LEAK AS HELL (But if a application already will be closed, this should makes no difference)
Of cause, if the allocator is enabled you won’t find your real memory leaks, because its a wrapper for all other new/deletes it could be use to create your own leak-detector

Would be cool to have something like an “OwningThreadLocalPointer”, which calls the a destructor when a thread gets closed

Proof of concept, only tested with MS VS2008

To enable it please define USE_CACHINGALLOCATER

Every memory block will be classified into its own size category.
Every size category/thread use its own cache
Size Categories 0 = 0…2 bytes
1 = 3…4 bytes
2 = 5…8 bytes
3 = 9…16 bytes
4 = 17…32 bytes
5 = 33…64 bytes
6 = 65…128 bytes
7 = 129…256 bytes
8 = 257…512 bytes
9 = 513…1024 bytes and so on…
-1= > 2<<CK_PREALLOCATOR_RECYCLE_MEMORYBLOCKS_UP_TO_SIZE_CATEGORY

Bigger MemoryBlocks will be just allocated by malloc/free

File:
https://github.com/jchkn/ckjucetools/blob/master/CachingAllocaterMultiThreaded/CachingAllocaterMultiThreaded.cpp


#2

Interesting and fun to play with, but in my humble experience, I’ve found that

a) Whatever great idea you think you’ve had about improving malloc, someone else has had it before, and the default implementation probably already does it better than anything you could ever write.

b) When you actually hit a performance hot-spot where malloc really is causing trouble, then blaming malloc is the wrong reaction. Instead you’ll get much better results by blaming the code that calls malloc too often, and improving your algorithms at a higher and more localised level.


#3

thanks for your opinion

a) the problem isn’t that malloc isn’t fast, the problem is that it shares the same lock with all threads!(at least here in vs2008)
If thread A allocates 128 MB, and thread B only 16 bytes, thread B has to wait

b) The question is, why we get performance hot-spots where malloc causing trouble, mostly because of the lock!
All these techniques we use, to avoid malloc, are because of this single lock attiude.
And sometimes complexity goes way overhead, it can be painful to implement a “malloc”-free design, and it won’t look more elegant.


#4

Yeah, I understood your post, but what I meant is that I’d be very surprised if many/most of the standard malloc implementations don’t already use thread-local tricks internally. It’s hardly a new idea!

Would be interested to see your benchmarks, but just think you’re being optimistic about how much improvement you can get!


#5

The Visual Studio runtimes (all versions) definitely use just a single global critical section, and not thread-local tricks.


#6

you should have a look at Google’s tcmalloc http://goog-perftools.sourceforge.net/doc/tcmalloc.html

dlmalloc http://g.oswego.edu/dl/html/malloc.html
and TLSF http://www.gii.upv.es/tlsf/ are also worth looking at


#7

Thanks, did you use one of them successfully on different platforms win/mac?
I ask cause I had a quick look on tcmalloc, i found it a little bit exorbitant and confusing for what i need, also it seems it had only rudimentary windows support.
Can’t think why my solution should’nt work as quick as the fastest allocaters. (assumed that local thread cached is filled.)


#8

Well, the interesting question is what happens when one thread allocates a piece of memory and another thread tries to delete it later?


#9

nothing, it will be cached in its own thread specific cache, if the cache is full it will be freed by c++ runtimes own free-function.
Every block of memory stores its size-Category in its header, so when the thread-cache release it, its completely independent.

If a application heavily frees memory-blocks from other threads, this may result in a little more usage of the original c++ malloc/free function. You can see this as a downside if you want.
If i have a little more time, i will add a little more statistical information, so it shows the “practical” usage of the Heap-Lock.
Also i will write a test benchmark-app, which do all the “bad things” (like new() on audio-thread/performance intrusion through Critical Sections), which measures the audio-drop outs, will see if the caching allocator makes any difference.
Todays machines are so damn fast, will be hart to measure a difference. If you already chosen a heap-lock free design you shouldn’t have any benefit.


#10

[quote=“chkn”]Thanks, did you use one of them successfully on different platforms win/mac?
I ask cause I had a quick look on tcmalloc, i found it a little bit exorbitant and confusing for what i need, also it seems it had only rudimentary windows support.
Can’t think why my solution should’nt work as quick as the fastest allocaters. (assumed that local thread cached is filled.)[/quote]

I didn’t use tcmalloc or dlmalloc on real projetcs, but we’re using TLSF as a real-time memory allocator for Lua.