Lowest-latency real-time pitch detection

My user sings into the mic. iPhone/macBook mic, or headset mic, I suppose. I need to get out the f0 of the note they are singing.

I need to keep it ultra-low latency.

What options are available?

I will attempt to maintain a summary by editing this initial post upon replies.

So far I’m aware of:

Related:

2 Likes

I think it depends how low the lowest frequency you want to detect is I think, and how much noise you are prepared to tolerate in the pitch data…?

Not sure what the state of the art is though!

1 Like

I think E2 is the lowest I’d expect anyone to sing.
Users will be singing into their mobile phones or a laptop mic (at worst).
I’m tempted to run with creating a karplus-strong ring-resonator for each note, as I’ve implemented it before. But if I can find something that works cleanly out of the box, I’d much rather go with that.

The important thing is that when the user sings, they will get audio feedback. And if the latency is too high, it’s gona make the brain glitch. So I want to optimize on buffer-size – I think that’s the bottleneck.

I think the theoretical lowest required time to detect pitch is 2 periods, so roughly 20 ms for 100 Hz, 2 ms for 1000 Hz. A low latency pitch detection that comes close to this is part of my autotune package, unfortunately this is not free software.

If you link to your product, this will improve this thread as a resource.

1 Like

Google for the Goertzel algorithm. Also used for resonance detection in for instance car components. Has nice real-time properties, hence its use in embedded systems.

I’ve edited the OP to make my use-case more clear.

I’ve looked into various approaches:

  • FFT (it’s possible to get exact spectrals by finding locally peaking bins and using their rate of rotation – e.g. if your bin is @100Hz and you sing @101Hz, every second the bin makes one revolution.
  • Goertzel filters (not sure how useful these are for harmonic tones)
  • KarplusStrong Ring-Resonator for each note.

My favourite is the third. However it’s tricky to implement exactly, as the number of samples in the ring would almost-always be non-integer. It’s fudgable.

1 Like

Hi,

My two cents.

  • FFT is generally a bad candidate to have a precise f0 exactly because of what you describe, you’d need quite a lot of extra processing to refine the picked peak.
  • As far as i understand, Goertzel filters seem more appropriate to detect the presence and measure the amplitude of some sines you suspect the presence beforehand.
  • You have several “autocorrelation related” methods like YIN as implemented in the github repo you mention in your OP. Couple of years ago, for pitch detection in a commercial plugin, i implemented the method described in this thesis (look chapter 4) which proposes some improvements of autocorrelation methods. I did not properly benchmark etc, but was quite satisfied overall with the quality of the detection on voice, and it’s relatively straightforward to understand and implement too which is a plus. There’s also a good state of the art of pitch detection methods in the thesis (though it’s been published in 2008, so there might be new fancy stuff).

Hope this helps.

2 Likes

The Pure Data object [helmholtz~] implements that last method, if I understand correctly. It’s indeed very good and Pure Data makes it very easy to test it out. The source code is available here: https://www.katjaas.nl/helmholtz/helmholtz.html

1 Like

In case anyone is still looking, it is hard to beat the Kalman filter:

9 Likes

Thank you Dr Smith! I tried the AU project, but the f0 estimation looks strange to me. Playing an E2 guitar string should return me an f0 around 82Hz, but here I get around 500Hz. Is there something I’m missing and needs to be adjusted?

Thanks,
Luca

the code has bugs:

  1. the nUpdate var defaults to 4
    This inlfuecnes the samplerate for the pitchdetection, but this is not calced in!
    So you need to add this compensation:
void EKFPitch::prepare(float fs, int bs,int nUpdate){
    
    sampleRate = fs / (float)nUpdate;
  1. the transient/silence detector is just too simple. It is not the fancy one described in the paper, just a simple power function over the last buffer

  2. the initial pitch is crude and also not as fancy as described in the paper.

I left out the initial pitch detector and fixed the nUpdate bug, and then it works correctly.
ekf.resetCovarianceMatrix() needs to be called everytime that the results go out of wack, you can use the silence detector for it, but maybe just triggering on bad f0 results is more efficient.

EDIT:
Allthough it looked promising at first, the whole thing is way too unstable and finnicky.
The cool stuff from the paper is not implemented and the tracking is very insecure.

1 Like

McLeod / MPM outperforms YIN a bit, especially if you need low latency and if you can’t do much post-processing. Paper is called “A smarter way to find pitch” iirc. I’ve had acceptable results incorporating it in a semi-zero-latency WSOLA pitch shifter.

Also, I haven’t heard of the Kalman approach, thanks for the tip!

MacLeod isn’t incredibly different from YIN, they both use autocorrelation and mainly differ in their postprocessing (“peak-picking”) techniques. Both of them can be accelerated by calculating the autocorrelation using an FFT.

Moving auto-correlation to FFT wouldn’t increase the overall latency?

The minimum achievable latency depends on the lowest frequency you want to be able to detect. The buffer you’re analysing needs to contain at least one full period of that frequency. For example, 441 samples for 100Hz at a sampling rate of 44100Hz.
Because of that, it doesn’t really matter how you calculate the ACF. FFT is the efficient way to do it, and simple parabola fitting will provide decent resolution for the peaks in the ACF (which are the candidates for picking the “right” fundamental).

I haven’t tried any approaches that are not based around autocorrelation, but from my understanding it’s impossible to reduce latency below the wavelength of the lowest fundamental because the information is just not there. Listen to a 1024 sample snippet of audio and try to figure out the pitch by ear - given how tricky that is, autocorrelation seems a pretty decent method to build upon.

This is too pessimistic, the minimum latency for pitch detection is about twice the true period length. Not the lowest.

1 Like

Isn’t figuring out the true period length the challenge here? You can make assumptions about how long that might be, in the sense that a vocal probably won’t go below 100Hz, and determine your analyze frame size that way.

I don’t see how “one period length” is pessimistic though. You correctly mentioned that in practice, you’ll typically need twice the true period length - which is worse.
So, as a theoretical lower bounds for latency, “one period length” holds - wouldn’t you agree?

A single period is not very periodic, I am unable to detect periodicity for a single period.

2 Likes

Ha, yes, that’s correct and it obviously makes sense - two periods are required. Sorry for the confusion. Turns out my post was too optimistic then :wink: