There are very capable open source ML models out there that can be useful for audio projects. The footprint is reasonably small for offline/standalone apps, even mobile. While searching for a C++ binding for OpenAI Whisper, I came across this one:
Unless I miss something, it should be easy to make a JUCE module for speech recognition based on this. The tensors/ML C++ library under the hood also looks very promising:
Together with a reasonably small open source LLM (also based on ggml), 100% voice-controlled audio apps come within reach of the JUCE community.
What Iām particularly interested in is streaming. Audio input is parsed continuously and a Juce memory stream churns out recognized tokens. There needs to be an audio chunk size of several seconds, because words canāt be disambiguated in isolation. Keeping a context window of 10,000 tokens or so should do the trick.
Youād need to run Whisper in a background thread. Use the audio thread to fill up a FIFO with 30 seconds of audio, copy it into the Whisper input buffer, run Whisper in the background thread to do inference. Put the output text in a FIFO and read this from the UI thread. This only works well if Whisper runs faster than real-time, which it should do.
Iāve used whisper.cpp, itās pretty easy to get it up and running. My idea was to automatically detect words like um, uh, and automatically remove them. However, those tokens automatically get removed. Anybody have an idea how to keep them in?
Thanks. Overlapping 30s chunks this way sounds like a great plan. You can do so much more with Juce and C++ than with Python (not to say it wouldnāt be possible, just not this straightforward and robust).
The LLM that comes behind the voice recognition is the harder part. Training hundreds of functions and parameters for your app is a daunting task. Not sure if that is even possible with current open source (offline) models.
Iām interested in this area. Regarding the ā30-second chunks of audioā, I think this is where we should be looking for āreal-timeā implementation of whisper:
Hi, RolandMR, did you was capable to use whisper.cpp along with Juce? I want to create a similar app than yours, a standalone app (or maybe a plug-in) that helps to remove irrelevant words from recorded audio (specifically for podcasts) for a specific client. Is it something too complex?
This is a VST3/ARA plugin built using JUCE 8, and it uses a WebView for its user interface. Right now it mainly targets REAPER, but I have been able to run it in Cubase as well.