0

In the book here, they apply liftering, as a final step of MFCCs features extraction, to isolate the system component by multiplying the whole cepstrum by a rectangular window centred on lower quefrencies, or by de-emphasizing the higher cepstral coefficients.

However, in other references (see the discussion here) they use sinusoidal liftering, defined as:

$$w_i=1+\frac{D}{2}\sin\Big(\frac{\pi i}{D}\Big)$$

Hence, I want to understand the following points:

  • Which liftering step should I do in speech recognition, and what is the difference between these two ways above?

  • What is the parameter $D$ in the formula above, and should I choose its value equal to 12 or 22 if I'm retaining the coefficients 2-to-13 (inclusive) in MFCCs?

Update1: According to this blog:

One may apply sinusoidal liftering to the MFCCs to de-emphasize higher MFCCs which has been claimed to improve speech recognition in noisy signals.

How does applying this lifter de-emphasize higher MFCCs, while the plot of the variable lift (in the blog) looks like this:

enter image description here

Update2:

It seems to me that sinusoidal liftering is used when we are dealing with MFCCs (the DCT of the logarithmic filter-bank energies), and the low-time liftering is used when we are dealing with the IDFT of the log magnitude spectrum to isolate the vocal tract components.

Since DCT, in this case, equivalents to IDFT, it seems to me when we keep the 2-13 (inclusive) cepstral coefficients and discard the rest, it's equivalent to the low-time liftering to isolate the vocal tract components and drop the source components (which have e.g. the F0 spike). Then, the sinusoidal lifter is used to emphasize the middle cepstral coefficients (according to the comments below).

Also, according to the reference here (page 4), the variable D in the formula above is equal to the number of cepstral coefficients (i.e. 12 in my case). In this case, the plot of this lifter looks like this:

enter image description here

And it seems more logical to de-emphasize higher MFCCs which has been claimed to improve speech recognition in noisy signals, but why should we de-emphasize lower MFCCs too?

UPD. @jojek's comment: It's not about de-emphasizing as much as emphasizing the middle.

So, why do we emphasize the middle? And should I choose $D = 12$ to emphasize the middle?

Abdulkader
  • 31
  • 8
  • 1
    Are you planning to use DTW or GMM's? The liftering was used to emphasize the medium-range coefficients. All those liftering schemes were tweaked by researchers in order to obtain the best performing one for certain applications. This also means that liftering used for speech recognition task might not work well for identity verification or cough recognition. Nowadays, there is no need to optimize the variance of the coefficients, since systems are either using plain Mel-energies or do the input normalization. – jojeck Jan 28 '20 at 18:08
  • @jojek, My application is Speech Command Recognition Using Deep Learning, and I want to feed the MFCCs features to a Convolutional Neural Network – Abdulkader Jan 28 '20 at 18:15
  • MFCC’s into CNN? That is rather unusual. Why not plain Mel energies? – jojeck Jan 28 '20 at 18:19
  • @jojek Yes, take a look here, and here. – Abdulkader Jan 28 '20 at 18:23
  • I can bet that given enough data, CNN on Mel energies will outperform it. You should try it. It makes more sense to do convolution on Mel spectrogram rather than on decorrelated coefficients. – jojeck Jan 28 '20 at 18:29
  • @jojek Thank you, I'll try to do convolution on Mel spectrogram, but why does it outperform MFCCs?. However, I'm dealing with Arabic speech commands, and I collected the dataset by myself from some contributors, so I have roughly 300 wave file per-class. Also, I used data augmentation techniques like time-shifting and time-stretching. Now, I need to understand the points in my question, so if you can answer me or help me editing my question to avoid downvotes, I'll really appreciate it. – Abdulkader Jan 28 '20 at 18:48
  • 1
    I would not worry so much about "negative votes" in this case but perhaps @jojek could write his response up as an answer (?). This is to get a chance to close the question gracefully. Otherwise, perhaps this can happen after the question is modified (?). – A_A Jan 29 '20 at 10:09
  • @A_A He advised me to try CNN on mel energies, and I'll do that. But, I still want to understand the points that I mentioned in my question, please! – Abdulkader Jan 29 '20 at 10:26
  • Could you please tell me how to modify my question to get a chance that someone answers me? Should I add more details or be more specific in some points? Because I still don't know what wrong in my question, and I'm trying my best as a non-native speaker. Thank you. – Abdulkader Jan 29 '20 at 10:36
  • It's not about de-emphasizing as much as emphasizing the middle. – jojeck Jan 30 '20 at 11:10
  • @jojek Could you see my question, in this link, about your comment, please? – Abdulkader Feb 27 '20 at 20:04

0 Answers0