OpenAI Whisper C++ and JUCE

There are very capable open source ML models out there that can be useful for audio projects. The footprint is reasonably small for offline/standalone apps, even mobile. While searching for a C++ binding for OpenAI Whisper, I came across this one:

Unless I miss something, it should be easy to make a JUCE module for speech recognition based on this. The tensors/ML C++ library under the hood also looks very promising:

Together with a reasonably small open source LLM (also based on ggml), 100% voice-controlled audio apps come within reach of the JUCE community.

Has anyone tried this already?

3 Likes

What I’m particularly interested in is streaming. Audio input is parsed continuously and a Juce memory stream churns out recognized tokens. There needs to be an audio chunk size of several seconds, because words can’t be disambiguated in isolation. Keeping a context window of 10,000 tokens or so should do the trick.

Whisper works on 30-second chunks of audio. To do this in a real-time context you’d end up with quite a bit of latency.

In ā€œstreaming modeā€, the technique that’s typically used is to have a sliding window of partially overlapping 30-second chunks, and then you filter out the words from the overlapping parts. See also: Making automatic speech recognition work on large files with Wav2Vec2 in šŸ¤— Transformers

You’d need to run Whisper in a background thread. Use the audio thread to fill up a FIFO with 30 seconds of audio, copy it into the Whisper input buffer, run Whisper in the background thread to do inference. Put the output text in a FIFO and read this from the UI thread. This only works well if Whisper runs faster than real-time, which it should do.

1 Like

You might want to take a look at ARA to avoid latency problems

I’ve used whisper.cpp, it’s pretty easy to get it up and running. My idea was to automatically detect words like um, uh, and automatically remove them. However, those tokens automatically get removed. Anybody have an idea how to keep them in?

1 Like

Thanks. Overlapping 30s chunks this way sounds like a great plan. You can do so much more with Juce and C++ than with Python (not to say it wouldn’t be possible, just not this straightforward and robust).

The LLM that comes behind the voice recognition is the harder part. Training hundreds of functions and parameters for your app is a daunting task. Not sure if that is even possible with current open source (offline) models.

If the recognized tokens come with timestamps you can map them back to the source and identify the gaps between as fillers to cut out.

I’m interested in this area. Regarding the ā€˜30-second chunks of audio’, I think this is where we should be looking for ā€˜real-time’ implementation of whisper:

[ Real-time audio input example ]

[ whisper.cpp/examples/stream ]

Hi, RolandMR, did you was capable to use whisper.cpp along with Juce? I want to create a similar app than yours, a standalone app (or maybe a plug-in) that helps to remove irrelevant words from recorded audio (specifically for podcasts) for a specific client. Is it something too complex?

Hi, I’d like to share an open source project I’ve been working on which integrates with whisper.cpp: Introducing ReaSpeech Lite - Tech Audio

This is a VST3/ARA plugin built using JUCE 8, and it uses a WebView for its user interface. Right now it mainly targets REAPER, but I have been able to run it in Cubase as well.

Please let me know if you have any questions!

1 Like