Real-Time Audio Classification/Feature Extraction: A talk for next years conference

Hi Juce folk.

I wonder if people would be kind enough to chime in on a thought I’m having.

I’m currently in the last stages of my MSc and have been working on my final project. The project is a JUCE plugin which basically involves real-time feature extraction and audio classification for short term buffers.

Basically I’ve written a plugin which allows you to “beatbox” into a microphone and in real-time classifies the sound as a class/label (previously trained by the user) and then fires of the corresponding drum synth/MIDI note. So if you “beatbox” a kick sound into the mic a kick drum synth will be triggered, same goes for snare and hi-hat (only three classes for the moment). This pretty much happens without any audible latency, such is the goal of my project.

This has got me thinking. I watched Sean’s talk in regards to feature extraction (awesome by the way) but the aims of the code are different to mine. A ring buffer is used to feed the feature extraction routines with audio data which are called from the GUI/Message thread as far as I understand it. The actual extraction algorithms look like they used std::vectors etc and would mean memory allocations on the audio thread if called from the process block, I guess their deliberately not focused on being non-blocking. Libxtract also performs memory operations in some of its functions like its DCT, so would need some modifications to be used in my project.

I’m actually using a modified version of the Gist library inside my feature extraction stage and have been chatting with it’s creator recently in regards to possibly altering its implementation slightly - similar to my fork. Basically removing the use of std::vectors as containers and make it more strictly “real-time friendly”. I initially used Gist as there was only so much I could do in the time allotted to me for my project and making changes to Gist was far easier than implementing all the excellent algorithms from scratch!

My project has focused on being able to extract features and classify them on the audio thread in a non-blocking manner which minimises audible latency to the user if a musical event is triggered in response to the classification.

…It all seems to work. I’m reasonably sure things are “real-time” correct and intend to make certain of this by the time I attempt releasing the code publicly. (At the moment there’s the odd ugly bit of code rushed for the sake of academic deadlines, 6 months only!)

I’m tied up till the new year 2017 finishing of the plugin and writing my final MSc report (been a beast alongside full-time work) but the code that deals with extraction and classification in real-time is in its own lib/module.

I’m wondering if the JUCE community thinks a module/library like this would be of use and whether or not a talk about it’s development / capabilities would be of interest at an ADC ?

I intend on spending time after the new year to clean up the version 1.0 code base and ensure minimal dependencies are required. For example the code currently uses libarmadillo for matrix representation in the classification stage. Armadillo uses expression templates (like many linear algebra libs) and thus avoids temporaries. This has meant it is basically possible to avoid any memory allocations when dealing with matrix operations on a real-time callback thread. Provided careful pre-allocation of memory for matrix objects is used etc.

I’m thinking of swapping from Armadillo to Eigen though as Eigen is deliberately designed to be header only whereas Armadillo blurs the lines between which functions do and do not require BLAS and LAPACK libs etc.

Eigen might allow for one less setup/dependency issue for users wanting to use this code (currently dubbed the AudioClassify library). I just need to try a version using Eigen and step through to make sure no malloc’s get called - depends on how nifty the expression template stuff is. So far Armadillo is pretty damn impressive. I’m keen to make this library JUCE based so have a few updates to make such as using the JUCE fft rather than Kissfft (due to Gist usage) etc.

I’m aware of Rapidmix and similar libraries which are experimenting with interactive machine learning but think this is more related to human gestures for parameter control etc. I don’t think there is specific code in the Rapidmix stuff for ral-time classification of audio buffers yet (i.e. get a classification response per-buffer in the processBlock) but I may well be wrong and reinventing a better built wheel. I’m wondering if my idea could be of use to any of these circles and the JUCE community in general. I know ROLI has vested interest in Rapidmix etc.

If anyone could chime in / give me some ideas on this id be hugely grateful. Apologies for the giant post.

As always thanks alot guys. And HUGE thanks to JUCE for the ADC talks going up so fast. Christmas early.



P.S. Still looking to break into the C++ world job wise so hoping this whole project and a possible talk might help give me a bit of credibility later next year :wink: Javascript jusn’t isn’t my love in life.


HA. Damn.

Looks like I ought to attend even if I might only just qualify for “intermediate” status…

Is anyone able to shed any light on this new JUCE module at all ?

Will the module be handling classifying audio buffers/frames in real-time ?

i.e. Extract features on buffers and classify a buffer as “snare”, “dog bark” etc. etc.

Or is the focus more on gesture recognition for parameter control etc ?

Cheers JUCE team.