How do I implement cross-correlation to prove two audio files are similar?

Question

I have to do cross correlation of two audio file to prove they are similar. I have taken the FFT of the two audio files and have their power spectrum values in separate arrays.

How should I proceed further to cross-correlate them and prove that they're similar? Is there a better way to do it? Any basic ideas will be helpful for me to learn and apply it.

Given the cross correlation of two random signal vectors. How do you implement the reverse to get the two vectors in MATLAB. John Muhehe — , Feb 24 '15 at 11:40

score 70 · Answer 1 · edited Dec 20 '21 at 14:13

70

Cross-correlation and convolution are closely related. In short, to do convolution with FFTs, you

zero-pad the input signals a and b (add zeros to the end of each. The zero padding should fill the vectors until they reach a size of at least N = size(a)+size(b)-1)
take the FFT of both signals
multiply the results together (element-wise multiplication)
do the inverse FFT

conv(a, b) = ifft(fft(a_and_zeros) * fft(b_and_zeros))

You need to do the zero-padding because the FFT method is actually circular cross-correlation, meaning the signal wraps around at the ends. So you add enough zeros to get rid of the overlap, to simulate a signal that is zero out to infinity.

To get cross-correlation instead of convolution, you either need to time-reverse one of the signals before doing the FFT, or take the complex conjugate of one of the signals after the FFT:

corr(a, b) = ifft(fft(a_and_zeros) * fft(b_and_zeros[reversed]))
corr(a, b) = ifft(fft(a_and_zeros) * conj(fft(b_and_zeros)))

whichever is easier with your hardware/software. For autocorrelation (cross-correlation of a signal with itself), it's better to do the complex conjugate, because then you only need to calculate the FFT once.

If the signals are real, you can use real FFTs (RFFT/IRFFT) and save half your computation time by only calculating half of the spectrum.

Also you can save computation time by padding to a larger size that the FFT is optimized for (such as a 5-smooth number for FFTPACK, a ~13-smooth number for FFTW, or a power of 2 for a simple hardware implementation).

Here's an example in Python of FFT correlation compared with brute-force correlation: https://stackoverflow.com/a/1768140/125507

This will give you the cross-correlation function, which is a measure of similarity vs offset. To get the offset at which the waves are "lined up" with each other, there will be a peak in the correlation function:

The x value of the peak is the offset, which could be negative or positive.

I've only seen this used to find the offset between two waves. You can get a more precise estimate of the offset (better than the resolution of your samples) by using parabolic/quadratic interpolation on the peak.

To get a similarity value between -1 and 1 (a negative value indicating one of the signals decreases as the other increases) you'd need to scale the amplitude according to the length of the inputs, length of the FFT, your particular FFT implementation's scaling, etc. The autocorrelation of a wave with itself will give you the value of the maximum possible match.

Note that this will only work on waves that have the same shape. If they've been sampled on different hardware or have some noise added, but otherwise still have the same shape, this comparison will work, but if the wave shape has been changed by filtering or phase shifts, they may sound the same, but won't correlate as well.

edited Dec 20 '21 at 14:13

lennon310

3,590
19
24
27

answered Oct 29 '10 at 15:49

endolith

15,759
8
67
118

5

The zero padding should be at least N = size(a)+size(b)-1, preferably rounded up to a power of 2. To get a value between -1 and 1, divide by norm(a)norm(b), which gives the cosine of the angle between the two vectors in N-space for the given lag (i.e. circular shift modulo N). At the extreme lags, there aren't many overlapping samples (only one at the far extreme), so dividing by norm(a)norm(b) will bias these correlations toward 0 (i.e. showing their relative orthogonality in N-space). – Eryk Sun Oct 30 '10 at 01:41
1

I think there may be an error in the description. Shouldn't multiplying the FFTs together term by term give the FFT of the convolution of the signals, not the FFT of the cross-correlation? As I understand it, to get the FFT of the cross-correlation, it is necessary to use the complex conjugate of one of the FFT vectors in the term-by-term multiplications before taking the iFFT. – Dilip Sarwate Nov 30 '11 at 20:36
@DilipSarwate: Yes, you're right. You can also reverse one signal in the time direction, which I added to the answer. – endolith Nov 30 '11 at 20:49
May I suggest that you include complex conjugation in lieu of time reversal, or include both ways of doing it but put complex conjugation first? While time reversal is easy to do in software, it can be challenging in hardware, and this question was originally asked in the electronics.SE forum. Also, when adapting to autocorrelation, a newbie will not be tempted to do two FFTs, one for a and one for a reversed. – Dilip Sarwate Nov 30 '11 at 21:12
@DilipSarwate: Why is time reversal hard to do in hardware? – endolith Nov 30 '11 at 21:23
1

"Why is time reversal hard to do in hardware?" In lots of cases, data are stored in systolic arrays in the expectation that computations are local, i.e. $x[i]$, stored in the $i$-th cell, interacts only with its nearest neighbors $x[\pm i]$. Sending $x[i]$ to cell #$(N-i)$ and sending $x[N-i]$ to cell #$i$, and doing this for all $i$ increasing wiring costs, wiring delays (and hence reduces maximum achievable clock rate), and also, because all the wires must cross over each other, creates routing problems. It should be avoided if possible, and in this case, it is avoidable. – Dilip Sarwate Nov 30 '11 at 22:05
'''fft(a and zeroes) * conj(fft(b and zeroes))''' results a matrix or a scalar depending on column or row vector input, should it be '''fft(a and zeroes) .* conj(fft(b and zeroes))''' (pointwise multiplication instead of matrix multiplication)? – Leo Jan 22 '14 at 13:44
1

@Leo element-wise multiplication. n-by-1 array x n-by-1 array = n-by-1 array I called this "sample-by-sample" in the answer. – endolith Jan 22 '14 at 20:34
Ah thanks. I thought you meant that but I wasnt sure. I think your edit made it more clear. – Leo Jan 23 '14 at 13:53
@DilipSarwate I think the point is that conj(fft(b and zeroes)) is of complexity O(n), while the fft(reverse(b and zeros)) is of complexity O(n log n), so we prefer the former option. – ady May 08 '18 at 22:41
@endolith "To get a similarity value between -1 and 1 (a negative value indicating one of the signals decreases as the other increases) you'd need to scale the amplitude according to the length of the inputs, length of the FFT, your particular FFT implementation's scaling, etc. The autocorrelation of a wave with itself will give you the value of the maximum possible match." How do I achieve this? With respect to my question here? – skrowten_hermit Sep 21 '20 at 09:58
@endolith Why doesn't "corr(a, b) = ifft(fft(a_and_zeros) * fft(b_and_zeros[reversed]))" look the same as "corr = signal.correlate(a, b, mode='same')"? The signal module is from the scipy library. – Stephen305 Nov 18 '20 at 20:06
1

@Stephen305 If you use mode='full' they should do the same thing – endolith Nov 18 '20 at 20:22
@VMMF I thought at least half of the wave is "blank" was more readable and understandable than the actual math – endolith Dec 20 '21 at 17:32
For digital signals conj(fft(x)) does modulo N time reversal, not true time reversal. So you'll either have to multiply a linear phase on in the frequency domain to shift by one sample in addition to the conjugation, or do the flipping in the time domain, which can be inefficient (as has been pointed out already). – Gillespie Dec 29 '22 at 01:48

score 18 · Answer 2 · edited Apr 13 '17 at 12:33

Correlation is a way to express the similarity of two timeseries (audio samples in your case) in one number. It is an adaptation of covariance which is implemented as follows:

period = 1/sampleFrequency;
covariance=0;

for (iSample = 0; iSample<nSamples; iSample++)
    covariance += (timeSeries_1(iSample)*timeSeries_2(iSample))/period;
    //Dividing by `period` might not even be necessary

The correlation is the normalized version of covariance, which is the covariance divided by the product of the standard deviations of both the time series. The correlation will yield a 0 when there is no correlation (totally not similar) and a 1 for total correlation (totally similar).

You can imagine that two sound samples might be similar but are not synchronized. That's where cross correlation comes in. You calculate the correlation between the time series where you have one of them shifted by one sample:

for (iShift=0; iShift<nSamples; iShift++)
    xcorr(iShift) = corr(timeSeries_1, timeSeries_2_shifted_one_sample);

Then seek out the maximum value in the corr series and you're done. (or stop if you've found a sufficient correlation) There's slightly more to it of course. You must implement the standard deviation and you must do some memory management and implement the time shifting stuff. If all your audio samples are equal in length, you might do without normalizing the covariance and go ahead and calculate the cross-covariance.

A cool relation to your earlier question: Fourier analysis is just an adaptation of the cross covariance. Rather than shifting one time series and calculating the covariances with the other signal, you calculate the covariances between one signal and a number of (co)sine waves with different frequencies. It's all based on the same principle.

You mentioned that 0 is no correlation and 1 is total correlations. I just want to note that -1 is complete negatively correlated. As in, -1 implies that sample 1 is the opposite of sample 2. If you think about it on an X,Y graph, it is a line with positive slope versus a line with negative slope. And as you get closer to 0 the line gets "fatter". — Kellenjb, Oct 29 '10 at 13:23
@kellenjb, Yes, but I would probably word it, the magnitude of correlation what you are probably interested in. a 1 or a -1 mean the signals directly affect each other. — Kortuk, Oct 29 '10 at 14:52

score 18 · Answer 3 · answered Oct 30 '10 at 05:12

In signal processing the cross-correlation (xcorr in MATLAB) is a convolution operation with one of the two sequences reversed. Since time reversal corresponds to complex conjugation in the frequency domain, you can use the DFT to compute the cross-correlation as follows:

R_xy = ifft(fft(x,N) * conj(fft(y,N)))

where N = size(x) + size(y) - 1 (preferably rounded up to a power of 2) is the length of the DFT.

Multiplication of DFTs is equivalent to circular convolution in time. Zero padding both vectors to length N keeps the circularly shifted components of y from overlapping with x, which makes the result identical to the linear convolution of x and time reversed y.

A lag of 1 is a right circular shift of y, while a lag of -1 is a left circular shift. The cross-correlation is simply the sequence of dot products for all lags. Based on standard fft ordering, these will be in an array that can be accessed as follows. Indices 0 through size(x)-1 are the positive lags. Indices N-size(y)+1 to N-1 are the negative lags in reverse order. (In Python the negative lags can be accessed conveniently with negative indices such as R_xy[-1].)

You can think of the zero-padded x and y as N-dimensional vectors. The dot product of x and y for a given lag is |x|*|y|*cos(theta). The norms of x and y are constant for circular shifts, so dividing them out leaves just the varying cosine of the angle theta. If x and y (for a given lag) are orthogonal in N-space, the correlation is 0 (i.e. theta = 90 degrees). If they're co-linear, the value is either 1 (positively correlated) or -1 (negatively correlated, i.e. theta = 180 degrees). This leads to the cross-correlation normalized to unity:

R_xy = ifft(fft(x,N) * conj(fft(y,N))) / (norm(x) * norm(y))

This can be made unbiased by recomputing the norms for just the overlapping parts, but then you may as well do the entire computation in the time domain. Also, you'll see different versions of normalization. Instead of being normalized to unity, sometimes the cross-correlation is normalized by M (biased), where M = max(size(x), size(y)), or M-|m| (an unbiased estimate of the mth lag).

For maximum statistical significance the mean (DC bias) should be removed before computing the correlation. This is called the cross-covariance (xcov in MATLAB):

x2 = x - mean(x)
y2 = y - mean(y)
phi_xy = ifft(fft(x2,N) * conj(fft(y2,N))) / (norm(x2) * norm(y2))

Does this mean the final size of the array should be 2*size (a) + size(b) - 1 or 2*size (b) + size (a) - 1? But in either case the two padded arrays are of different sizes. What is the consequence of padding with too many zeros? — , Nov 29 '11 at 01:20
@RobertK The cross-correlation array needs to be of length at least the sum of lengths of a and b (minus one) as eryksun says in his answer. For simplicity, the length is often taken to be twice the length of the longer vector (sometimes rounded up to the next larger power of $2$ in order to use an efficient FFT). The choice helps when the customer belatedly decides he also wants the autocorrelation of the longer vector. One consequence of padding with too many zeroes is additional computation but this might be ameliorated by more efficient FFT implementations. — Dilip Sarwate, Nov 30 '11 at 20:54
@RobertKJ: You're sliding b along a, with one output per shift, a minimum overlap of one sample. That yields size(a) positive lags and size(b) - 1 negative lags. Using the inverse transform of the product of N-point DFTs, indices 0 through size(a)-1are the positive lags, and indices N-size(b)+1 through N-1 are the negative lags in reverse order. — Eryk Sun, Nov 30 '11 at 23:56

score 3 · Answer 4 · answered Nov 14 '17 at 18:26

3

A quick and simple way to compare audio files. Take the audio file, make a copy, in a daw, paste them side by side, in 2 stereo channels, invert the phase on one of the stereo tracks, align both of the file at the beginning in zoom mode, be sure that the both files have same amplitude at the beginning, then play ,if there's total silence, then both files are identical, if there's a difference you'll hear it pretty clearly!.

answered Nov 14 '17 at 18:26

user31971

31
1

This is clever, thanks for sharing the idea! – blakev Jan 16 '23 at 05:55

score 3 · Answer 5 · edited Jun 17 '20 at 08:24

3

if you are using Matlab try the cross correlate function:

c= xcorr(x,y)

Here's the Matlab documentation:

xcorr estimates the cross-correlation sequence of a random process. Autocorrelation is handled as a special case.

...

c = xcorr(x,y) returns the cross-correlation sequence in a length 2*N-1 vector, where x and y are length N vectors (N > 1). If x and y are not the same length, the shorter vector is zero-padded to the length of the longer vector.

correlation http://www.mathworks.com/help/toolbox/signal/ref/eqn1263487323.gif

edited Jun 17 '20 at 08:24

Community

1

answered Oct 29 '10 at 08:12

smashtastic

131
1

The link seems to be broken. – Danijel Oct 03 '17 at 07:00

score 1 · Answer 6 · answered Dec 22 '18 at 17:54

The easiest way to find the difference, IMO, is to subtract the two audio signals in the time domain. If they are equal, the result at every time point will be zero. If they are not equal, the difference between them will be left after subtraction and you can listen to it directly. A quick measure of how similar they are would be the RMS value of this difference. This is often done in audio mixing and mastering to hear the difference of an MP3 vs WAV file for example. (Inverting the phase of one signal and adding them is the same as subtracting. This is the method used when this is done in DAW software.) They must be perfectly time aligned for this to work. If they are not you could develop an algorithm to align them such as detecting the top ten peaks, calculating the average offset of the peaks, and shifting one signal.

Transforming to the frequency domain and comparing the power spectra of the signals like you propose is ignoring some time domain information. For example audio played in reverse would have the same spectrum when played forward. Thus, two very different audio signals could have the exact same spectrum.

score 1 · Answer 7 · edited Nov 04 '21 at 00:34

1

There is a dedicated software to do that : audiodiffmaker

edited Nov 04 '21 at 00:34

Peter K.

25,714
9
46
91

answered Nov 03 '21 at 22:09

Loïc MEIGNIEN

11
1

score 0 · Answer 8 · answered Aug 19 '18 at 15:30

0

As most here wrote you should use correlation.

Just take 2 factors under consideration:

If the volume is scaled differently, you should normalize the correlation.
If there is scaling of the time then you may use Dynamic Time Warping.

answered Aug 19 '18 at 15:30

David

144
1
3
18

How do I implement cross-correlation to prove two audio files are similar?

8 Answers8

Linked

Related