Is there already an algorithm in existence that takes noisy midi (generated from audio) and tries to figure out what the intended notes are? If so, what is it called? Below is a visual representation of what I’d like the algorithm to do.
You can’t generalize this very good to suit all of your midi data input. You can see that there is a pattern here in some way and the best option would be to create an AI model that takes Midi input as the above image and gives Midi output as your desired filtered ones. Midi notes are just integers and you shouldn’t have any trouble interpreting them to a model.
Thanks for the quick reply. I suspected AI would be mentioned. I’ve done some (Udacity AI Programming with Python NanoDegree and some Kaggle competitions). But AI is rabbit hole I’d hoped to not go down in this project. Any techniques or resources you recommend?
Is the audio to midi translation algorithm under your control? Maybe it’s a good idea trying to extract better midi data before trying to fix bad midi data?
Some possible basic methods:
- Filter out short notes (minimum note length threshold)
- Filter out notes out of pre-defined scale (if known) or force notes to closest note in the scale
- Concanate consequtive identical notes into a single note (some threshold could be defined, ie. 1/64 time)
You have two options:
- Hard code certain conditions on which you decide whether notes will be kept or removed like what @JussiNeuralDSP recommended
- Improve the whole scenario by improving the algorithm you use to get the midi in the first place which is guaranteed to give more globally optimal solution i.e that is what you should care about
If fixing the midi notes aren’t worth using AI for it, why not trying to do the whole thing as an AI model that takes audio and gives midi. Would give better results in my opinion. Your only challenge would be to figure the correct form in which you will feed the audio buffer to the Model. No issue for the midi as I said, plain integers. Needs work, but lots of potential
Yes. It is midi generated from singing (humming). I can allow the user to preset keys/scales to snap to, which helps. But humans seem to “find” the note they want after a little searching. Vibrato and other inconsistencies are issues, too. It is quite complicated, I know, but I’m giving it a shot.
The data in isn’t “bad”, the problem is that I have to interpret what the singer meant to sing.
My current idea is to require that the user start each new note with a “sharp” consonant (Ba, Pa, Da). Then I will use voice recognition techniques to detect consonants in the audio data. The start of a consonant will be the start of a new note. Then I will use the most common frequency (as detected by pYIN) after the consonant to determine the midi note. End of a note is determined when volume stays below a certain threshold for a certain length of time, or a new consonant is detected.
I had hoped to not restrict the way the user sings, but this is my best idea at this point. Does this seem like a good solution, or should I look in another direction?