9

I would like to train a network that can parse clothing:

enter image description here

Does anyone know of a fashion segmentation network with or without pre-trained weights available that I can use in Mathematica?

Update:

I haven't been able to find a pre-trained network that I might port into Mathematica via MXNetLink, so I've decided to train one.

Here is a link to the fashionista dataset of pixel level labels for clothing. The first issue I've encountered is blocking my progress: I can't import the .mat file:

enter image description here

If someone can help me parse this dataset, I'd like to train a semantic segmentation net like this example from Wolfram Community. Or perhaps there is another source of training data that is accessible from Mathematica?

References:

Fashion related segmentation is described in these papers, blogs, and projects:

M.R.
  • 31,425
  • 8
  • 90
  • 281

1 Answers1

9

Yes, but that particular dataset is not extremely great.

Instead of using the Fashionista dataset, let's use this dataset instead: https://github.com/bearpaw/clothing-co-parsing. The Fashionista dataset is too much inside the Matlab walled garden (I am living in a glass house and throwing stones).

This dataset is much more usable. 1004 images have masks (a matlab matrix, but one we can read). They look like this:

mask

Let's load the images and masks.

images = (File /@ 
     FileNames["~/Downloads/clothing-co-parsing-master/photos/*"])[[;;
      1004]];

masks = Import[ "~/Downloads/clothing-co-parsing-master/annotations/pixel-level/*"]

Now we'll turn resize the masks to something much smaller (my poor laptop isn't beefy enough for larger image dimensions). Input images will get resized by the network NetEncoder. We add 1 to the masks so that they are the lie in the same range as the output.

dims = {104, 157};
amasks = Round[
  ArrayResample[First[#], {157, 104}, "Bin", 
   Resampling -> "NearestRight"]] + 1 & /@ masks;

So we can generate image->mask data for the neural net like so:

data = Thread[images -> amasks];

Now we'll build an extremely quick-and-dirty pixel-level segmentation network as a proof-of-concept.

This network is of my own design, which means it's probably terrible. However, it does go fast. We chop off the last few layers of Squeezenet and attach a few skip connections. This network does a decent job of binary prediction - let's see how it does with 59-class prediction.

A quick improvement to this network is to use more than just the first few layers of squeezenet. However, I don't want to watch my computer train a network slowly forever, so I have settled for about the first half of squeezenet.

squeeze = 
  NetModel["SqueezeNet V1.1 Trained on ImageNet Competition Data"];

tnet = NetGraph[ Join[Normal@ NetFlatten[ NetTake[NetReplacePart[squeeze, "Input" -> NetEncoder[{"Image", dims, "ColorSpace" -> "RGB"}]], "fire5"]], <|

&quot;f2u&quot; -&gt; {BatchNormalizationLayer[], 
  ResizeLayer[{Scaled[2], Scaled[2]}, &quot;Resampling&quot; -&gt; &quot;Nearest&quot;], 
  ConvolutionLayer[59, 1], ElementwiseLayer[&quot;ELU&quot;]},
&quot;f3u&quot; -&gt; {BatchNormalizationLayer[], 
  ResizeLayer[{Scaled[2], Scaled[2]}, &quot;Resampling&quot; -&gt; &quot;Nearest&quot;], 
  ConvolutionLayer[59, 1], ElementwiseLayer[&quot;ELU&quot;]},
&quot;f4u&quot; -&gt; {BatchNormalizationLayer[], 
  ResizeLayer[{Scaled[4], Scaled[4]}, &quot;Resampling&quot; -&gt; &quot;Nearest&quot;], 
  ConvolutionLayer[59, 1, &quot;PaddingSize&quot; -&gt; 1], 
  ElementwiseLayer[&quot;ELU&quot;]},
&quot;f5u&quot; -&gt; {BatchNormalizationLayer[], 
  ResizeLayer[{Scaled[4], Scaled[4]}, &quot;Resampling&quot; -&gt; &quot;Nearest&quot;], 
  ConvolutionLayer[59, 1, &quot;PaddingSize&quot; -&gt; 1], 
  ElementwiseLayer[&quot;ELU&quot;]},
&quot;cat&quot; -&gt; CatenateLayer[],
&quot;drop&quot; -&gt; DropoutLayer[],
&quot;sig&quot; -&gt; ElementwiseLayer[&quot;ReLU&quot;],
&quot;con&quot; -&gt; {ResizeLayer[Reverse@dims], ConvolutionLayer[59, 1], 
  TransposeLayer[{1 &lt;-&gt; 3, 1 &lt;-&gt; 2}], SoftmaxLayer[]}
|&gt;

], {Fold[#2 -> #1 &, Reverse@Keys@Normal@NetFlatten[NetTake[squeeze, "fire5"]]], "fire2" -> "f2u", "fire3" -> "f3u", "fire4" -> "f4u", "fire5" -> "f5u", {"f2u", "f3u", "f4u", "f5u"} -> "cat" -> "drop" -> "sig" -> "con" }, "Input" -> NetEncoder[{"Image", dims, "ColorSpace" -> "RGB"}], "Output" -> NetDecoder[{"Class", Range[59], "InputDepth" -> 3}] ]

Now we can train this network. I have frozen the squeezenet weights, but it's not necessary (and not doing this will definitely result in a more accurate network).

net = NetTrain[tnet, data[[;; -30]], ValidationSet -> data[[-30 ;;]], 
  LearningRateMultipliers -> {"conv1" -> 0, "fire2" -> 0, 
    "fire3" -> 0, "fire4" -> 0}]

I trained this for a short time (5 rounds), let's see what it looks like:

test output

That's not bad at all for such a short time training. Especially because there are many improvements you could make to this extremely off-the-cuff network:

Good luck!

Carl Lange
  • 13,065
  • 1
  • 36
  • 70
  • For your”exercise”, are you just taking the decoded image and finding the nearest color for each label per pixel? – M.R. Jan 09 '19 at 22:27
  • Or are you combing the networks output with the input image and growing the regions with some sort of clustering components? What’s the method – M.R. Jan 09 '19 at 22:37
  • Yes, probably the first one (I would say it's significantly easier if you train for much longer with that network). The better method is to figure out how to train a pixel-level class predictor network. I'll work on it when I get the chance. – Carl Lange Jan 09 '19 at 22:49
  • Is the last rule of images you show the output -> decoded output or are you just showing the mask from the file? – M.R. Jan 09 '19 at 23:10
  • I'm showing output->original data mask (I'm showing how poorly this network did - it was more an illustration than a working example) – Carl Lange Jan 09 '19 at 23:22
  • Oh! I thought you were deriving it lol – M.R. Jan 09 '19 at 23:29
  • I've now figured out class prediction, update to my answer incoming – Carl Lange Jan 09 '19 at 23:37
  • @M.R. Just a note that I've updated my answer with class-based prediction. – Carl Lange Jan 10 '19 at 01:24
  • Nice update, however, now I get this: NetTrain::invencin: Invalid input, input should be a list of dimensions {104,157}. – M.R. Jan 10 '19 at 01:51
  • Sorry, reverse the dims - dims = {104, 157}. – Carl Lange Jan 10 '19 at 02:00
  • Hi, Did you test it in Mathematica 12.1? I found I can't run your code well. After Run NetGraph step. NetGraph::invnetvert: The vertex fire2, specified in the edge fire2->f2u, does not exist. – HyperGroups Aug 04 '20 at 06:39
  • Hmm, unfortunately I wrote this in 11.3. I will give it a look in the next day or two. It's possible that the Squeezenet definition has changed and you might need to modify it based on a different naming convention or something. – Carl Lange Aug 04 '20 at 09:46