How to combine Neural Network and Audio like Classify?

Question

I want to combine Audio object and Neural Network ,first thing I remember is Classify

windInstrument = ExampleData[{"Sound", #}] & /@ {"AltoFlute", "AltoSaxophone", "BassClarinet", "BassFlute", "Flute", "FrenchHorn", "Oboe", "SopranoSaxophone", "TenorTrombone", "Trumpet", "Tuba"};
nonwindInstrument = ExampleData[{"Sound", #}] & /@ {"Cello", "CelloPizzicato", "DoubleBass", "DoubleBassPizzicato", "OrganChord", "Viola", "Violin"};

c = Classify[<|"windInstrument" -> windInstrument, "nonwindInstrument" -> nonwindInstrument|>, Method -> "NeuralNetwork"]
ClassifierInformation[c]
c[ExampleData[{"Sound", "Clarinet"}]]
c[ExampleData[{"Sound", "Piano"}]]

It got the right result.

See Options@c

It use Spectrogram.So I try to simulate the process.Because of NetEncoder doesn't support Audio.So I use some ways to transform it.

One is Spectrogram, the other is AudioData

First, using Spectrogram and DNN, but it can't predict correctly.

windInstrumentFourier = Spectrogram[AudioResample[AudioFade@AudioTrim[#, 3], 22050], Frame -> None] -> "windInstrument" & /@ windInstrument;
nonwindInstrumentFourier = Spectrogram[AudioResample[AudioFade@AudioTrim[#, 3], 22050], Frame -> None] -> "nonwindInstrument" & /@ nonwindInstrument;
trainingData = Join[windInstrumentFourier, nonwindInstrumentFourier];

The size of Spectrogram is Union[ImageDimensions /@ trainingData[[All, 1]]](*{{360, 121}}*)

net = NetChain[{FlattenLayer[], 128, LogisticSigmoid, 128, LogisticSigmoid, 2, SoftmaxLayer[]}, 
               "Input" -> NetEncoder[{"Image", {360, 121}}], 
               "Output" -> NetDecoder[{"Class", {"windInstrument", "nonwindInstrument"}}]]

net = NetTrain[net, trainingData, MaxTrainingRounds -> 30];
net[Spectrogram[AudioResample[AudioFade@AudioTrim[#, 3], 22050], Frame -> None] &@ExampleData[{"Sound", "Clarinet"}]]
(*"windInstrument"*)
net[Spectrogram[AudioResample[AudioFade@AudioTrim[#, 3], 22050], Frame -> None] &@ExampleData[{"Sound", "Piano"}]]
(*"windInstrument"*)

Use CNN make the result better,but too slow and too complicated for Classify:

net = NetChain[{ConvolutionLayer[4, 3], Ramp, PoolingLayer[2, "Stride" -> 2], 
                ConvolutionLayer[6, 3], Ramp, PoolingLayer[2, "Stride" -> 2], 
                ConvolutionLayer[8, 3], Ramp, PoolingLayer[2, "Stride" -> 2], 
                32, Ramp, 2, SoftmaxLayer[]}, 
                "Input" -> NetEncoder[{"Image", {360, 121}}], 
                "Output" -> NetDecoder[{"Class", {"windInstrument", "nonwindInstrument"}}]]
net = NetTrain[net, trainingData, MaxTrainingRounds -> 30];
net[Spectrogram[AudioResample[AudioFade@AudioTrim[#, 3], 22050], Frame -> None] &@ExampleData[{"Sound", "Clarinet"}]]
(*windInstrument*)
net[Spectrogram[AudioResample[AudioFade@AudioTrim[#, 3], 22050], Frame -> None] &@ExampleData[{"Sound", "Piano"}]]
(*nonwindInstrument*)

Then I try use the sample of Audio directly.

windInstrumentSample = Partition[AudioData[AudioChannelMix[#, "Mono"] &@AudioResample[AudioFade@AudioTrim[#, 3], 22050]], 1024] -> "windInstrument" &/@ windInstrument;
nonwindInstrumentSample = Partition[AudioData[AudioChannelMix[#, "Mono"] &@AudioResample[AudioFade@AudioTrim[#, 3], 22050]], 1024] -> "nonwindInstrument" &/@ nonwindInstrument;

trainingData = Join[windInstrumentSample, nonwindInstrumentSample];

net = NetChain[{LongShortTermMemoryLayer[512, "Input" -> {"Varying", 1024}], 
                LongShortTermMemoryLayer[256], 
                SequenceLastLayer[], 128, LogisticSigmoid, 128, 
                LogisticSigmoid, 2,SoftmaxLayer[]}, 
                "Output" -> NetDecoder[{"Class", {"windInstrument","nonwindInstrument"}}]]

net = NetTrain[net, trainingData, MaxTrainingRounds -> 30];
net[Partition[AudioData[AudioChannelMix[#, "Mono"] &@AudioResample[AudioFade@AudioTrim[#, 3], 22050]], 1024] &@ExampleData[{"Sound", "Clarinet"}]]
(*windInstrument*)
net[Partition[AudioData[AudioChannelMix[#, "Mono"] &@AudioResample[AudioFade@AudioTrim[#, 3], 22050]], 1024] &@ExampleData[{"Sound", "Piano"}]]
(*nonwindInstrument*)

But sometimes it's correct,sometimes it's wrong...

So how to improve it better,faster,stronger,simpler,and more stable as Classify for Audio?

PS: And you see the Audio object has different attribute value.Surely we must transform it to a uniform format,is the process effect result?

score 4 · Answer 1 · edited Jun 15 '17 at 09:04

One thing we have to consider before applying deep neural network is the size of the data. If the data size is very small, which is the case here (only 18 examples), a very deep neural network may not converge well.

There are several ways to deal with small data set. One common way is to use transfer learning (see example here) which leverage on pretrained network and only train small a small portion of the large network. Another way is to apply data augmentation to generate more data. Another way is to to use a shallow neural network and work with dimension reduced data. I will demonstrate the last one, which is also how Classify did in the first place in your example.

First of all, we convert the audio into spectrograms.

windInstrument = ExampleData[{"Sound", #}] & /@ {"AltoFlute","AltoSaxophone", "BassClarinet", "BassFlute", "Flute","FrenchHorn", "Oboe", "SopranoSaxophone", "TenorTrombone", "Trumpet", "Tuba"};
nonwindInstrument = ExampleData[{"Sound", #}] & /@ {"Cello","CelloPizzicato", "DoubleBass", "DoubleBassPizzicato", "OrganChord","Viola", "Violin"};

audioToSpectrogram[a_] := Module[{data},
  data = SpectrogramArray[a, Automatic, Automatic, HannWindow];
  ImageResize[
   ImageAdjust@Image[Reverse@
     Transpose@Abs[data[[All, 1 ;; Dimensions[data][[2]]/2]]]], {266, 
    277}]
  ]

spectrogramData = 
  audioToSpectrogram /@ Flatten[{windInstrument, nonwindInstrument}];

We then construct a dimension reduction function from our data. The original spectrogram data of 266 by 277 is reduced to a vector of size 17

dr = DimensionReduction[spectrogramData, 17, 
  Method -> "Linear"]

Now construct our training data from the dimension reduced data

trainingData = 
  RandomSample@
   MapAt[dr[audioToSpectrogram[#]] &, 
    Flatten[{Thread[windInstrument -> "wind"], 
      Thread[nonwindInstrument -> "nonwind"]}], {All, 1}];

Construct and train the neural network, we use only two layers

net = NetChain[{10, Tanh, 2, Tanh, SoftmaxLayer[]}, "Input" -> 17, 
  "Output" -> NetDecoder[{"Class", {"wind", "nonwind"}}]];

trained = 
 NetTrain[net, trainingData, 
  Method -> {"SGD", "L2Regularization" -> 0.1}, 
  MaxTrainingRounds -> 500]

evaluate on test data

trained@
   dr@
     audioToSpectrogram[ExampleData[{"Sound", #}]] & /@ {"Clarinet", 
  "Piano", "Bassoon"}
(* {"wind", "nonwind", "wind"} *)

partida · Accepted Answer · 2018-07-03T13:09:27.757

In 11.3, NetEncoder support Audio object.

windInstrument = 
  ExampleData[{"Sound", #}] & /@ {"AltoFlute", "AltoSaxophone", 
    "BassClarinet", "BassFlute", "Flute", "FrenchHorn", "Oboe", 
    "SopranoSaxophone", "TenorTrombone", "Trumpet", "Tuba"};
nonwindInstrument = 
  ExampleData[{"Sound", #}] & /@ {"Cello", "CelloPizzicato", 
    "DoubleBass", "DoubleBassPizzicato", "OrganChord", "Viola", 
    "Violin"};

(*Convert Sound to Audio that fits NetTrain's Input*)
trainingData = Join[Thread[Audio /@ windInstrument -> "windInstrument"], 
                    Thread[Audio /@ nonwindInstrument -> "nonwindInstrument"]];

Let's construct network.

Here I use AudioMelSpectrogram, this one and AudioMFCC are common features in speech tasks.

net = NetChain[{LongShortTermMemoryLayer[30], 
   LongShortTermMemoryLayer[10], SequenceLastLayer[], 2, 
   SoftmaxLayer[]},
  "Input"  -> NetEncoder["AudioMelSpectrogram"], 
  "Output" -> NetDecoder[{"Class", {"windInstrument", "nonwindInstrument"}}]]

net = NetTrain[net, trainingData];
types = {"Clarinet", "Piano", "Bassoon"};
net[Audio@ExampleData[{"Sound", #}]] & /@ types
(*{"windInstrument","nonwindInstrument","windInstrument"}*)

However now, there are more focus on raw Audio itself(NetEncoder["Audio"]) such as WaveNet and SampleRNN.

net = NetChain[{ConvolutionLayer[32, 80, "Interleaving" -> True], 
   LongShortTermMemoryLayer[10], SequenceLastLayer[], 2, 
   SoftmaxLayer[]}, "Input" -> NetEncoder["Audio"], 
  "Output" -> 
   NetDecoder[{"Class", {"windInstrument", "nonwindInstrument"}}]]

But it use so much memory, so it can't be done in real application.

And the features from xslittlegrass's answer is similar to NetEncoder["AudioSpectrogram"]

PS:

This way support out-of-core training, that means it can be used in real applications:

enc = NetEncoder["AudioMelSpectrogram"];
file = "ExampleData/rule30.wav";
a1 = Import[file];
(*in-core Audio object*)
a2 = Audio[file];
(*out-of-core Audio object*)
Dimensions /@ NetEncoder["AudioMelSpectrogram"][{a1, a2}]
(*{{215,40},{215,40}}*)
ByteCount /@ {a1, a2}
(*{161128,424}*)

How to combine Neural Network and Audio like Classify?

2 Answers2