How to implement STFT for processing audio

Question

I should start this off by saying I’m a hobbyist and by no means a student. I’ve been reading “The Audio Programming Book”, and attempting to implement STFT on the stm32 based Daisy platform.

In order to learn from a working sample, I’ve been using mutable instruments’ Shy_fft for fft processing. So far, I have normal fft > Ifft functioning and am beginning to take steps towards STFT.

I cannot seem to get the stft functionality working, and am struggling to even understand if I’m processing / understanding the fundamentals.

Here is the second implementation I tried. The first version was overwritten by the update

The first version seemed to “work” but there was an issue where the output was “choppy”, seeming to be dropping frames. The second version didn’t seem to output any audio.

EDIT: I should mention that the gist wasn’t updated with a functioning audio output. I’ll try and update it later and remove this edit

Any tips/comments on where I’m going wrong/how I should be doing things are greatly appreciated.

EDIT 2: See updated NaiveSTFT class here

In the first link, in the process function, you check separately for the fft_buffer_index and ifft_buffer_index, aren't they supposed to be the same through initialisation? Not that this will solve your problem but if you don't provide some information on how you use the class I am not sure it will be very easy to find where the problem is. Could you provide a minimal working example? — ZaellixA, Sep 17 '23 at 11:22
I believe that in the second link, you provided, line 102, the fft_.Direct() function should be outside the inner for-loop, as, if I understood the implementation correctly, it is supposed to act on the whole buffer (which is FFT_SIZE long and offset based on the "hop size" and frame - which seems correct -), which means that it should be called when the windowing process is finished, which in turn is outside the inner for-loop of line 91. I haven't run the code so I am not sure this may be an issue but based on my understanding it seems like a problem to me. — ZaellixA, Sep 17 '23 at 11:26
You may be right! I didn’t think about processing outside of the loop, due to counting through the frames. I’ll give it a go later.
Also, I’ll provide the class in context of its use as well. I should have thought to do that! — Daniel Lawler, Sep 17 '23 at 11:31
This is the right track @ZaellixA! So now I'm getting audio out again! The issue now is it seems there's some underlying signal that sounds kind of like a low oscillation on the output, as well as the "choppy" audio quality.
Here's the updated class and here's the actual implementation — Daniel Lawler, Sep 17 '23 at 16:02
There are various things I'm missing here, but I'll start with those I think are the most important. You are multiplying with the window function the output of the IFFT, are you sure this is needed (implementation for-loop starting on line 124)? I believe you should apply an overlap-save/overlap-add technique here instead of windowing the data. This will introduce consecutive Hann windows in the time domain which will most probably sound like very short successive sounds (since you seem to be using a very short frame size). — ZaellixA, Sep 17 '23 at 19:22
Furthermore in your Main.cpp file, IN_L and IN_R are note declared somewhere so I suppose they are some kind of global variables. Nevertheless, since you do get some audio out I believe this shouldn't be part of the problem. But, I believe, at least for troubleshooting you should work on only one channel and with some basic signals whose output is known (so that you have a reference to compare against). — ZaellixA, Sep 17 '23 at 19:24
I’m going to transition to testing a mono signal rather than summing both left and right. Good point. As for the windowing on the analysis filter, I thought that this was required for fft/stft as the signals naturally won’t start/finish at the same point making a noise between frames, so you window the signal so that they begin and end at 0. — Daniel Lawler, Sep 18 '23 at 05:41
Windowing is mostly done to mitigate the spreading of the energy in neighbouring frequency bins, but on the reconstruction/(re)synthesis phase, I believe you should use an overlap-add algorithm. Consider the fact that there is supposed to be an overlap in the time-domain signal, which I believe you are not doing in the (re)synthesis process (please let me know if I am wrong here). So conceptually, the last ("overlap percentage" $\cdot$ "FFT length" is common between many (at least two) frames. This part should be overlapped in the time domain in the reconstruction too. I think you miss that. — ZaellixA, Sep 18 '23 at 10:23
As always, thank you. I realized something while reviewing the code. In the for loops at 124 and 135, I’m actually applying the overlap add twice, incorrectly. There should be one overlap add on the synthesis window ( [pointer_hop_position + j]), but not on the output buffer — Daniel Lawler, Sep 18 '23 at 11:49
Did this solve the problem? I suggest you still test against a known solution to make sure your code works correctly. If not, this way will still be easier to spot the problem as you will most probably know where to look if you already know the correct solution (simple (co)sinewaves are usually a rather good start). If this was the problem, could you please write an answer to your question and accept it for future reference? — ZaellixA, Sep 18 '23 at 13:42
This solved some of the chopiness in the output but not the overall signal quality. But regardless thank you for the assistance. I will absolutely update this question regardless of what the answer is.
I think the current issue is twofold: 1: the math behind the hop and frames do not appear to be functioning properly. If you drop down the frame size to 1 and make the buff size and fft size equal, the signal is almost perfect but still has the underlying digital noise. Which leads us to 2: I cannot seem to figure out where that noise is coming from. I have updated the gists to reflect — Daniel Lawler, Sep 18 '23 at 17:16
Please keep in mind that in order for the analysis and resynthesis process to be "exact" (down to numerical errors) without spectral processing of the signal, of course, you have to satisfy the COLA constraint. For more info see: https://ccrma.stanford.edu/~jos/sasp/COLA_Examples.html, https://ccrma.stanford.edu/~jos/sasp/STFT_COLA_Decomposition.html and https://gauss256.github.io/blog/cola.html. — ZaellixA, Sep 18 '23 at 17:52
This may be related to your problem since the Hann window (which you seem to be using but I can't be sure as you, like many people use Hanning, which is not a window. It is either Hann or Hamming) with 50% overlap does not satisfy the COLA criterion, which is almost guaranteed to degrade your audio quality. Depending on what characteristics this "noise" you refer to has, this may also be related to the windowing process. — ZaellixA, Sep 18 '23 at 17:53
I hadn't looked at any of your code, but here is a reasonably concise mathematical description that you might base code on. — robert bristow-johnson, Sep 19 '23 at 01:57
@ZaellixA if I’m understanding what you’re saying, it’s that since I’m not doing anything to the signal in the frequency domain, I’m not satisfying the cola criterion which could lead to the distortion? It makes sense - I figured that doing itching to the signal would be a good POC but never figured that I’d need to do something to the signal, if that makes sense — Daniel Lawler, Sep 19 '23 at 04:04
@robertbristow-johnson thank you so much for the link! I’m going to read and attempt to implement! — Daniel Lawler, Sep 19 '23 at 04:05
I was doing some testing, and realized that my buffer sizes for STFT may be a little bit too large, especially with how I have been processing the data. Once I dropped this to 256 and 512 respectively, the audio "choppiness" disappeared entirely (possibly a cycle vs sample issue?) Once that was cleaned up, I was able to easily isolate the noise, which turned out to be the windowing. Which means @ZaellixA, you were absolutely correct.
Can someone give this one a "final once over" to make sure that I'm not losing my mind, and can assume it's being done correctly? — Daniel Lawler, Sep 19 '23 at 09:25
Please keep in mind that the "choppiness" of your audio may still be there but not be audible...!!! If the "choppiness" has been reduced in time it could very well be masked by other (spectral) components of the signal! Usually, decreasing the "buffer size" results in more CPU consumption and decreased robustness (increased "choppiness" and glitches), which is the opposite here. I am still not convinced your algorithm works correctly but I don't know what kind of processing you do. I suggest you drop the processing completely and make sure your STFT (analysis + synthesis) works correctly. — ZaellixA, Sep 19 '23 at 10:37
I'll try to have a look at the code and if possible run it with a different FFT implementation (assuming they are both equivalent). If I spot any issues I'll let you know. — ZaellixA, Sep 19 '23 at 10:39
@ZaellixA - when you say to make sure my STFT works, do you have any suggestions? I've removed the FFT functions and have passed it directly to the synthesis window (leaving hop sizes etc in) and the signal stays the same.
I want to mention that I'm not doubting you - I just wanted to know if you had any suggestions on how to test. Good points on the buffer size / fft size making the chops inaudible due to size.

I'm actually not doing any processing, as I wanted to get the signal conversion 1:1 before I try anything in the freq/spec realm, so the signal should be unaltered. — Daniel Lawler, Sep 19 '23 at 10:54
This is good, I later realised you don't process the spectrum in any way, so disregard that point. My only suggestion on troubleshooting is to use known signals, with known results and try one thing at a time. You say you have validated your in-out process (without the FFT) which is a good step. What I would suggest is to try and run your STFT algorithm offline (in a non-real-time environment) to make sure that the algorithm per se is correct. If the offline results are not what you expect (this is to get the input signal at the output unaltered - down to numerical accuracy) then your (cont.) — ZaellixA, Sep 19 '23 at 13:27
(cont.ed) algorithm must have an issue. Otherwise, the problem may be related to other parts of your program (such as the audio library, or other OS "issues"). This will allow you to pinpoint the issues of the algorithm outside the "strict" real-time environment. If the algorithm works correctly offline but not online, then the issue may be that "you are too slow" to catch up with the audio callback. In this case, you can run some timing tests on your algorithm and, based on your audio parameters (sample rate, buffer size, etc.) see if being slow is the problem. — ZaellixA, Sep 19 '23 at 13:31

How to implement STFT for processing audio

0 Answers0