Interpolation of missing audio signal in a video sequence

Question

Suppose there is a video sequence and there are some frames for which the audio data is missing. I want to interpolate the missing audio data on the basis of the correlation of the audio signal with the pixels of the frames. Is there any existing research on how to do this effectively?

I was thinking of using the Wiener-Hopf equations, but I can't determine if there is any scalar parameter of the video which is correlated with the audio.

Some insight on this would help.

What kind of audio data are we talking about? This is relevant, since most audio codecs are some variants of predictors based on temporal differences, so usually, audio codecs deal rather gracefully with missing compressed data, simply be repeating what they did with the last chunk. So, you'll need to define your problem a little better to make us understand how the video aspect comes into play, and what you're actually trying to solve. — Marcus Müller, Nov 15 '18 at 20:29
Well, the audio data in a video where the source of sound (speaker) is present in the video frames itself. — Curiosity, Nov 15 '18 at 22:00
I mean, I am trying to work with normal mp4 videos, and extract the audio signal separately. — Curiosity, Nov 15 '18 at 22:00
but MP4 is a digital format and doesn't contain "audio" but a "compressed digital audio signal", so is your outage in the a) source audio or b) in the compressed audio stream? (please answer a) or b), I'm really confused.) — Marcus Müller, Nov 15 '18 at 22:14
Well, I used MATLAB to separately extract the audio information using the audioRead command. — Curiosity, Nov 15 '18 at 22:17
I'm new to this domain, but I kind of learnt that the audioRead command in MATLAB can be used to explicitly extract the audio signal. Then again, the type of audio may be the compressed form, like you said. — Curiosity, Nov 15 '18 at 22:18
You try to model something missing. you have to tell us what is missing from where. — Marcus Müller, Nov 15 '18 at 22:19
The thing is: there must exist some correlation (usually) between the pixels producing sound and the sound itself considering all frames of a video. At least, I've been told to believe so. — Curiosity, Nov 15 '18 at 22:19
Some frames are there for which the sound is missing. That's what I want to interpolate. — Curiosity, Nov 15 '18 at 22:20
But, instead of just using the previous and next audio samples for the interpolation, I want to use the additional correlation of audio with the pixels. — Curiosity, Nov 15 '18 at 22:21
Ok, but where does that sound go missing in your loss model? Is it an error in transmitting the compressed audio, or prior to compression? Aside from the optical audio tracks on the good old celophane film, audio isn't encoded in the same chunks as the frames. So, I'm really not sure what to model here. Anyways, your question is really missing the information that you want to reconstruct audio from the video! And that seems to me much more central than that it's missing for a frame... — Marcus Müller, Nov 15 '18 at 22:23
Yes, you're correct. The audio signal that I'm getting is not synchronized with the frames. I wonder how that can be done. — Curiosity, Nov 15 '18 at 22:27
please start by editing your question and saying exactly what you're trying to achieve: reconstruction of audio based on video — Marcus Müller, Nov 15 '18 at 22:42

score 1 · Answer 1 · answered Nov 16 '18 at 02:06

In general, no. It's possible to create an MP4 file where the video frames and the audio are completely uncorrelated (edited, recorded from different source events, synthesized using different random seeds, etc.)

However, in some cases, it might be possible to infer a correlation, such as video of a human speaker whose face is clearly visible enough to allow lip reading algorithms estimate some percentage of the phonemes, with a statistically significant match against a portion of the nun-corrupted audio, etc.

score 0 · Answer 2 · answered Aug 07 '20 at 10:01

As a lot of DSP tasks, there is of course some people in artificial intelligence working in that, using the ubiquitous so-called Deep Learning. So as posed by @hotpaw2, long gaps are unlikely to be filled with a single video. However, interpolation of missing audio is also called "audio inpainting", and there is a recent paper on using multimodality, eg video + speech for that purpose, with examples and codes.

Vision-Infused Deep Audio Inpainting, ICCV 2019

Multi-modality perception is essential to develop interactive intelligence. In this work, we consider a new task of visual information-infused audio inpainting, \ie synthesizing missing audio segments that correspond to their accompanying videos. We identify two key aspects for a successful inpainter: (1) It is desirable to operate on spectrograms instead of raw audios. Recent advances in deep semantic image inpainting could be leveraged to go beyond the limitations of traditional audio inpainting. (2) To synthesize visually indicated audio, a visual-audio joint feature space needs to be learned with synchronization of audio and video. To facilitate a large-scale study, we collect a new multi-modality instrument-playing dataset called MUSIC-Extra-Solo (MUSICES) by enriching MUSIC dataset. Extensive experiments demonstrate that our framework is capable of inpainting realistic and varying audio segments with or without visual contexts. More importantly, our synthesized audio segments are coherent with their video counterparts, showing the effectiveness of our proposed Vision-Infused Audio Inpainter (VIAI). Code, models, dataset and video results are available at this https URL

Interpolation of missing audio signal in a video sequence

2 Answers2