1

I am building an application that would "listen" to the microphone input, analyse it, and compare the analysis to a pre-analysed and pre-classified sound bank (small - maximum 20 sounds). It will then show the user what sound it was.

Now, I have a vague idea on how to implement this. I would like to choose a set of features that would best represent the sounds. The issue is that the sounds in the sound bank could be whatever the user recorded. From strong onsets and short sounds, to long onsets and long sounds.

The current features I'm thinking of are:

  • Spectral Centroid
  • Spectral Flux
  • Spectral Rolloff

What do you think? Would these be sufficient to properly classify the sound? Also, as these features output a single value for specific sound buffer, how would you go about handling the feature vector that represents the whole sound? I am using kNN for classification, and was wondering what's the best way to compare two feature vectors? would cross-correlation be a feasible technique?

Thanks a lot!

P.S I have seen that a similar question was asked here, but it doesn't fully answer my issues.

nevos
  • 161
  • 1
  • 5
  • I don't know what accuracy or FP/FA rate you want to achieve, but for variety of sounds (unknown) these features won't be enough. When it comes to kNN analysis then it could be some starting point. As for features, use frame based classification and average of it. – jojeck Mar 10 '15 at 21:28
  • Thanks @jojek. What other features would you suggest? Also, when you say average do you mean just take the result from each frame and do a simple average? Sum of all elements in feature vector divided by number of frames analysed? – nevos Mar 10 '15 at 21:46
  • If, for example, I add MFFCs to the equation, would that be sufficient? How would I go about comparing MFCCs? AFAIK, MFCC extraction outputs coefficients, I need to compare the coefficients to the rest of the sounds? – nevos Mar 10 '15 at 22:00
  • I can't help with telling you what features to use, but a very easy way to tell similarity between two vectors is using the 2 norm(a-b). If you aren't familiar with it, in the most simplistic form it is the euclidean distance between two vectors. This will only work if every feature is equally important, otherwise you might have to weight the vectors before doing the norm – andrew Mar 10 '15 at 22:12
  • Thanks @andrew, And what if the vectors are of a different length? – nevos Mar 10 '15 at 22:18
  • If they are of different lengths you couldn't really compare them directly. You would need some way to map one feature set, to the other (by elimination (gram-schmidt) or adding information). But it would be best to have the same features to describe every sample. Otherwise you are "comparing apples to oranges" as the saying goes – andrew Mar 10 '15 at 22:26
  • Well, I would have the same features describing every sound. But the sound itself could be shorter, or longer. Therefore, comprised of a different number of buffers. For example, a short sound could be fully captured using 4 buffers, and a long one using 20. – nevos Mar 10 '15 at 22:33
  • Let's keep it simple since you are not going after sophisticated solution. You have two signals with different length and for each you extract same feature vectors. That gives matrices with same number of columns. Let's say you got few samples of each sound. Now you have incoming sample and trying to classify it. You find it's boundaries (activity detection, whatever) and calculate it's distance to each representative sample of each class using DTW. Then you make your decision based on kNN classifier. As simple as that. When it comes to features - you try and pick best, 3 won't do. – jojeck Mar 11 '15 at 08:27
  • @jojek thank you! really helpful stuff. Do you know of any activity detection and DTW implementation for iOS? – nevos Mar 11 '15 at 09:54
  • I never had iPhone in my hand, I am writing this from my Nexus and don't know how to write 'Hello World' in Objective C. Google is your friend. – jojeck Mar 11 '15 at 10:06
  • Never had an iPhone in your hand!? O_o Thanks a lot for your help! You can write your comment as an answer and I'll mark it as the answer if you want. – nevos Mar 11 '15 at 10:15
  • I am a Linux guy with old robust ThinkPad, not a shiny apple ;-) DTW is few lines of code anyway so you won't have a problem. I will write answer at some point with more details – jojeck Mar 11 '15 at 10:34

2 Answers2

2

Have a look @ librosa, a simple python library for audio analysis, implementing common features. Here is a great introduction and example notebook.

ruoho ruotsi
  • 1,770
  • 9
  • 10
0

https://github.com/jsingh811/pyAudioProcessing You can use MFCC's as features using the above library for audio classification tasks and a variety of sklearn classifiers.

jsingh
  • 1