1

I need to summarize a set of samples representing an audio to achieve a smooth zoom animation for a spectrogram in my app.

In the time domain (waveform) I achieve this by resummarizing the samples each time the zoom changes, where my current summarization goes like this: I know the audio's sampling frequency and I have a target summarized samples count that depends on the zoom level, so that there will be $$\text{hop} = \frac{\#\text{\{audio samples\}}}{\#\text{\{target samples\}}} = \frac{\#\text{\{audio samples\}}}{\text{screenWidthInPt}*\text{zoom}}$$ samples of original audio for each summarized sample. Also the summarized sample $i$ is obtained through averaging the original samples in a window of a certain amount of samples, that I set to $w=128$, so that:

$$\text{Summarized}[i] = \sum\limits_{k=i*\text{hop}}^{k=i*\text{hop}+w}\text{samples}[k],\quad\quad i\:\lvert\: i*\text{hop}+w < \#\{\text{audio samples}\}$$

Now, I was wondering what operation does this correspond in the frequency domain. If this was simply a sampling without averaging over that window, this would be a multiplication for a Dirac comb of period $T = \text{hop}$ in the time domain, and therefore a convolution for a Dirac comb of period $T = \frac{1}{\text{hop}}$ in the frequency domain (less expensive than summarizing in the time domain and recomputing the DFT from scratch), but how does the averaging in the time domain affect the frequency domain? Or in other words, what does summarizing in the time domain correspond to in the frequency domain? (I'm a bit rusty as it's been years since I took my signals theory class)

Baffo rasta
  • 111
  • 2
  • I've never heard the word "summarize" before. It appears you're performing a kind of "moving average" of blocks of time samples (where a block comprises the samples within a time "window"). To answer your question we would need to know if the blocks of samples overlap in the time domain. – Richard Lyons Mar 12 '23 at 14:13
  • @RichardLyons: I should have specified that windowing functions I apply in the time domain definitely overlap and are rectangular (so no smoothing performed) – Baffo rasta Mar 12 '23 at 15:00
  • Hop is STFT convolution stride or subsampling, which is folding in frequency. It's not entirely clear to me if that's what's happening here though. – OverLordGoldDragon Mar 12 '23 at 15:00
  • If "summarizing" means "making it look good on a graph" than your approach seems really odd to me. Taking the mean (or sum) removes all the relevant detail, I would expect you want to keep the peaks and dips. But I'm not exactly sure what the goal iof "summarizing" is. – Hilmar Mar 12 '23 at 16:03
  • @Hilmar: The screen has a limited amount of points I can use to display a plot line. Say the user zooms out, then of course it means he wants to see less details about the waveform, and that's why I'm downsampling. The goal in this case is to allow the user to decide what level of details they want to see and do it with a smooth zoom animation (pinch gesture on iOS)

    EDIT: Of course I'm keeping the original audio samples in memory and the downsampling isn't done in place, but on a different buffer.

    – Baffo rasta Mar 12 '23 at 16:07
  • @Bafforasta I understand that. In that case, your algorithm is really not great. Imagine a some samples that are [-1 +1 -1 +1 -1 +1 ...] and you have one pixel to display it. Taking the mean would make it 0. It's better to keep the max and min instead (use two pixels over twice the buffer length) – Hilmar Mar 13 '23 at 03:27
  • @Hilmar: Averaging over nearby samples should be somewhat equivalent to an average blur filter for an image (right?). Say I'd follow your advice, then what should I do with the min and max for each window? How do I get a waveform out of it that doesn't look like a collection of spikes? Sorry for my ignorance :P – Baffo rasta Mar 13 '23 at 08:21
  • @Bafforasta: It's NOT the same as a blur filter. The main difference here is the audio is bipolar an the mean value is zero (which is NOT what you want). Images are unipolar and the mean is the average luminosity over the area (which IS what you want). You don't get spikes, you get the envelope. The details are too long for a comment so maybe you should ask a separate question about that. – Hilmar Mar 13 '23 at 21:22

1 Answers1

1

In either domain, a good (visually appealing with minimal information loss) way to smoothly zoom into a signal is to use a Sinc-like (windowed Sinc of some width) interpolation kernel for the downsampling. A Sinc interpolation kernel in either domain acts as smoother, summarizing local information. In the time domain, a Sinc interpolator of the proper width acts as a low pass filter suitable for anti-aliasing.

hotpaw2
  • 35,346
  • 9
  • 47
  • 90
  • So to see if I understood correctly what you mean: I should pick a normalized sinc multiplied by one of the standard window functions (say Blackman-Harris) and then apply it globally in the sense of Whittaker–Shannon interpolation formula, say that $x_w[n]$ is the result of such operation. Now I pick a sample from $x_w$ every $hop$ samples of original audio. Am I correct? If so, since I'd apply this to sounds captured by a microphone (that can be whatever), how can I choose an appropriate width for the window? – – Baffo rasta Mar 13 '23 at 10:18