Here’s the real time PSOLA implementation I’m currently working on:
The idea here is that there’s one “PsolaAnalyzer” object than can communicate to an unlimited number of “PsolaShifter” objects, with the goal of lowering the CPU load because analysis is a heavy task. This repo is still a work in progress; this should make you some pitch shifted sound but there are definitely still some issues that I’m attempting to iron out. If anyone wants to cast an eye over this and help me identify & fix issues, that would be amazing… 
With this kind of algorithm, generally the latency is determined by the pitch detection required for the PSOLA’s analysis phase. With a time-domain approach like ASDF or AMDF, a good rule of thumb is for the latency to be 2 * the maximum possible detectable period (ie, 2 * the period of the lowest detectable frequency). And even if you use some sort of FFT trickery that can get away with detecting pitch from fewer samples, the resynthesis OLA process will still need analysis grains that are at least a period long, so I think it would be hard to get around requiring at least 1 or 2 input periods’ worth of latency (meaning, of course, that the actual plugin’s latency must be 1 or 2 maximum possible input periods’ worth of samples).
But even if you’re calibrating for a bass voice – let’s say your lowest possible detectable input frequency is a D2 (or midinote 38), that’s a frequency of 73.42 Hz. At a sampling rate of 44.1 kHz, a single period of this frequency is ~601 samples. So if the latency is double that, we get 1202 samples = ~27.26 milliseconds. Not terrible, but definitely perceivable.
Obviously with this paradigm, the latency would decrease as the input vocal range increases. So for a soprano, if your lowest possible note can be even as high as C4 (midinote 60, frequency 261.63), the max period would be ~169 samples, making our latency 338 samples = 7.66 ms!
Here’s an example of an ASDF-based pitch detector. This is thoroughly tested and should be decently accurate. The latency is like I describe above, 2 * max input period. Let me know if it misbehaves for you.
Low latency is always a hard thing to get right, especially for live performance. There’s often a trade off between latency and sound quality (for pitch shifting in particular, sometimes lower latency equates with more distorted formants. The lowest latency you can get would be to just use a classic vocoder technique, which imprints the input signal’s formants onto a synthesized carrier signal that’s already at the desired pitch, instead of doing any actual pitch resynthesis, so those can just be one sample in, one sample out with no granularity required, I think).
I’ve found that on the pitch shifting front, PSOLA seems to be a good middle ground between latency and quality. I know less about timestretching, but AFAIK the same basic principles apply.
Hope this helps.