Real-time AI audio generation VST3 — Architecture feedback welcome

Hey,

I’ve built a VST3 plugin that integrates Stable Audio Open for real-time AI music generation and would love some feedback on the architecture.

What it does:

  • User inputs text prompts

  • LLM generates optimized audio generation parameters

  • Stable Audio Open generates audio (~10s latency)

  • VST handles playback, MIDI triggering (C3-B3), and tempo sync

  • 8-track sampler with page switching (A/B/C/D per track)

Technical stack:

  • Frontend: JUCE framework (VST3)

  • Backend: Python server (FastAPI) handling AI inference

  • Audio processing: Real-time time-stretching to match host tempo

  • Communication: REST API between VST and inference server

Architecture challenges I’m tackling:

  1. Latency management: Generation takes ~10s — how to handle this UX-wise?

  2. Audio buffer handling: Managing generated samples in real-time playback

  3. Tempo sync: Stretching AI-generated audio to match host BPM without artifacts

  4. MIDI integration: Mapping C3-B3 to trigger samples reliably

Current approach:

  • Asynchronous generation queue

  • Background threads for API calls

  • Local caching of generated samples

  • Simple time-domain stretching (looking into phase vocoder)

Questions for the community:

  • Anyone working on similar plugin-server architectures?

  • Best practices for handling long-running async operations in JUCE?

  • Recommendations for high-quality time-stretching libraries?

GitHub: GitHub - innermost47/ai-dj: The sampler that dreams. AI-powered VST3 for real-time music generation. Generate tempo-synced loops, trigger via MIDI, sculpt the unexpected. 8-track sampler meets infinite sound engine. No pre-made tracks—just raw material you control. Play with AI. Stay human.
License: AGPL v3.0 (open source)

All code is public. Happy to discuss implementation details or architecture decisions.

Thanks!