I want to combine Audio object and Neural Network ,first thing I remember is Classify
windInstrument = ExampleData[{"Sound", #}] & /@ {"AltoFlute", "AltoSaxophone", "BassClarinet", "BassFlute", "Flute", "FrenchHorn", "Oboe", "SopranoSaxophone", "TenorTrombone", "Trumpet", "Tuba"};
nonwindInstrument = ExampleData[{"Sound", #}] & /@ {"Cello", "CelloPizzicato", "DoubleBass", "DoubleBassPizzicato", "OrganChord", "Viola", "Violin"};
c = Classify[<|"windInstrument" -> windInstrument, "nonwindInstrument" -> nonwindInstrument|>, Method -> "NeuralNetwork"]
ClassifierInformation[c]
c[ExampleData[{"Sound", "Clarinet"}]]
c[ExampleData[{"Sound", "Piano"}]]
It got the right result.
See Options@c
It use Spectrogram.So I try to simulate the process.Because of NetEncoder doesn't support Audio.So I use some ways to transform it.
One is Spectrogram, the other is AudioData
First, using Spectrogram and DNN, but it can't predict correctly.
windInstrumentFourier = Spectrogram[AudioResample[AudioFade@AudioTrim[#, 3], 22050], Frame -> None] -> "windInstrument" & /@ windInstrument;
nonwindInstrumentFourier = Spectrogram[AudioResample[AudioFade@AudioTrim[#, 3], 22050], Frame -> None] -> "nonwindInstrument" & /@ nonwindInstrument;
trainingData = Join[windInstrumentFourier, nonwindInstrumentFourier];
The size of Spectrogram is Union[ImageDimensions /@ trainingData[[All, 1]]](*{{360, 121}}*)
net = NetChain[{FlattenLayer[], 128, LogisticSigmoid, 128, LogisticSigmoid, 2, SoftmaxLayer[]},
"Input" -> NetEncoder[{"Image", {360, 121}}],
"Output" -> NetDecoder[{"Class", {"windInstrument", "nonwindInstrument"}}]]
net = NetTrain[net, trainingData, MaxTrainingRounds -> 30];
net[Spectrogram[AudioResample[AudioFade@AudioTrim[#, 3], 22050], Frame -> None] &@ExampleData[{"Sound", "Clarinet"}]]
(*"windInstrument"*)
net[Spectrogram[AudioResample[AudioFade@AudioTrim[#, 3], 22050], Frame -> None] &@ExampleData[{"Sound", "Piano"}]]
(*"windInstrument"*)
Use CNN make the result better,but too slow and too complicated for Classify:
net = NetChain[{ConvolutionLayer[4, 3], Ramp, PoolingLayer[2, "Stride" -> 2],
ConvolutionLayer[6, 3], Ramp, PoolingLayer[2, "Stride" -> 2],
ConvolutionLayer[8, 3], Ramp, PoolingLayer[2, "Stride" -> 2],
32, Ramp, 2, SoftmaxLayer[]},
"Input" -> NetEncoder[{"Image", {360, 121}}],
"Output" -> NetDecoder[{"Class", {"windInstrument", "nonwindInstrument"}}]]
net = NetTrain[net, trainingData, MaxTrainingRounds -> 30];
net[Spectrogram[AudioResample[AudioFade@AudioTrim[#, 3], 22050], Frame -> None] &@ExampleData[{"Sound", "Clarinet"}]]
(*windInstrument*)
net[Spectrogram[AudioResample[AudioFade@AudioTrim[#, 3], 22050], Frame -> None] &@ExampleData[{"Sound", "Piano"}]]
(*nonwindInstrument*)
Then I try use the sample of Audio directly.
windInstrumentSample = Partition[AudioData[AudioChannelMix[#, "Mono"] &@AudioResample[AudioFade@AudioTrim[#, 3], 22050]], 1024] -> "windInstrument" &/@ windInstrument;
nonwindInstrumentSample = Partition[AudioData[AudioChannelMix[#, "Mono"] &@AudioResample[AudioFade@AudioTrim[#, 3], 22050]], 1024] -> "nonwindInstrument" &/@ nonwindInstrument;
trainingData = Join[windInstrumentSample, nonwindInstrumentSample];
net = NetChain[{LongShortTermMemoryLayer[512, "Input" -> {"Varying", 1024}],
LongShortTermMemoryLayer[256],
SequenceLastLayer[], 128, LogisticSigmoid, 128,
LogisticSigmoid, 2,SoftmaxLayer[]},
"Output" -> NetDecoder[{"Class", {"windInstrument","nonwindInstrument"}}]]
net = NetTrain[net, trainingData, MaxTrainingRounds -> 30];
net[Partition[AudioData[AudioChannelMix[#, "Mono"] &@AudioResample[AudioFade@AudioTrim[#, 3], 22050]], 1024] &@ExampleData[{"Sound", "Clarinet"}]]
(*windInstrument*)
net[Partition[AudioData[AudioChannelMix[#, "Mono"] &@AudioResample[AudioFade@AudioTrim[#, 3], 22050]], 1024] &@ExampleData[{"Sound", "Piano"}]]
(*nonwindInstrument*)
But sometimes it's correct,sometimes it's wrong...
So how to improve it better,faster,stronger,simpler,and more stable as Classify for Audio?
PS: And you see the Audio object has different attribute value.Surely we must transform it to a uniform format,is the process effect result?





