Fashion segmentation with a neural net?

Question

I would like to train a network that can parse clothing:

Does anyone know of a fashion segmentation network with or without pre-trained weights available that I can use in Mathematica?

Update:

I haven't been able to find a pre-trained network that I might port into Mathematica via MXNetLink, so I've decided to train one.

Here is a link to the fashionista dataset of pixel level labels for clothing. The first issue I've encountered is blocking my progress: I can't import the .mat file:

If someone can help me parse this dataset, I'd like to train a semantic segmentation net like this example from Wolfram Community. Or perhaps there is another source of training data that is accessible from Mathematica?

References:

Fashion related segmentation is described in these papers, blogs, and projects:

Kota Yamaguchi, M Hadi Kiapour, Luis E Ortiz, Tamara L Berg, "Parsing Clothing in Fashion Photographs", CVPR 2012.
https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Liu_DeepFashion_Powering_Robust_CVPR_2016_paper.pdf
http://vision.is.tohoku.ac.jp/~kyamagu/research/clothing_parsing/
https://github.com/pongsate1/fashion-parsing
https://arxiv.org/abs/1703.01386

Looking at the file "fashionista-v0.2.1/fashionista_v0.2.1.mat" and after a cursory reading of the explanations, it seems to me that you might be better off using the images in the older version "fashionista-v0.2.tgz". (At least at first...) — Anton Antonov, May 29 '18 at 17:32

Carl Lange · Accepted Answer · 2023-09-06T12:04:08.930

Yes, but that particular dataset is not extremely great.

Instead of using the Fashionista dataset, let's use this dataset instead: https://github.com/bearpaw/clothing-co-parsing. The Fashionista dataset is too much inside the Matlab walled garden (I am living in a glass house and throwing stones).

This dataset is much more usable. 1004 images have masks (a matlab matrix, but one we can read). They look like this:

Let's load the images and masks.

images = (File /@ 
     FileNames["~/Downloads/clothing-co-parsing-master/photos/*"])[[;;
      1004]];
masks = Import[
  "~/Downloads/clothing-co-parsing-master/annotations/pixel-level/*"]

Now we'll turn resize the masks to something much smaller (my poor laptop isn't beefy enough for larger image dimensions). Input images will get resized by the network NetEncoder. We add 1 to the masks so that they are the lie in the same range as the output.

dims = {104, 157};
amasks = Round[
  ArrayResample[First[#], {157, 104}, "Bin", 
   Resampling -> "NearestRight"]] + 1 & /@ masks;

So we can generate image->mask data for the neural net like so:

data = Thread[images -> amasks];

Now we'll build an extremely quick-and-dirty pixel-level segmentation network as a proof-of-concept.

This network is of my own design, which means it's probably terrible. However, it does go fast. We chop off the last few layers of Squeezenet and attach a few skip connections. This network does a decent job of binary prediction - let's see how it does with 59-class prediction.

A quick improvement to this network is to use more than just the first few layers of squeezenet. However, I don't want to watch my computer train a network slowly forever, so I have settled for about the first half of squeezenet.

squeeze = 
  NetModel["SqueezeNet V1.1 Trained on ImageNet Competition Data"];
tnet = NetGraph[
  Join[Normal@
    NetFlatten[
     NetTake[NetReplacePart[squeeze, 
       "Input" -> NetEncoder[{"Image", dims, "ColorSpace" -> "RGB"}]],
       "fire5"]],
   <|
&quot;f2u&quot; -&gt; {BatchNormalizationLayer[], 
  ResizeLayer[{Scaled[2], Scaled[2]}, &quot;Resampling&quot; -&gt; &quot;Nearest&quot;], 
  ConvolutionLayer[59, 1], ElementwiseLayer[&quot;ELU&quot;]},
&quot;f3u&quot; -&gt; {BatchNormalizationLayer[], 
  ResizeLayer[{Scaled[2], Scaled[2]}, &quot;Resampling&quot; -&gt; &quot;Nearest&quot;], 
  ConvolutionLayer[59, 1], ElementwiseLayer[&quot;ELU&quot;]},
&quot;f4u&quot; -&gt; {BatchNormalizationLayer[], 
  ResizeLayer[{Scaled[4], Scaled[4]}, &quot;Resampling&quot; -&gt; &quot;Nearest&quot;], 
  ConvolutionLayer[59, 1, &quot;PaddingSize&quot; -&gt; 1], 
  ElementwiseLayer[&quot;ELU&quot;]},
&quot;f5u&quot; -&gt; {BatchNormalizationLayer[], 
  ResizeLayer[{Scaled[4], Scaled[4]}, &quot;Resampling&quot; -&gt; &quot;Nearest&quot;], 
  ConvolutionLayer[59, 1, &quot;PaddingSize&quot; -&gt; 1], 
  ElementwiseLayer[&quot;ELU&quot;]},
&quot;cat&quot; -&gt; CatenateLayer[],
&quot;drop&quot; -&gt; DropoutLayer[],
&quot;sig&quot; -&gt; ElementwiseLayer[&quot;ReLU&quot;],
&quot;con&quot; -&gt; {ResizeLayer[Reverse@dims], ConvolutionLayer[59, 1], 
  TransposeLayer[{1 &lt;-&gt; 3, 1 &lt;-&gt; 2}], SoftmaxLayer[]}
|&gt;

],
  {Fold[#2 -> #1 &, 
    Reverse@Keys@Normal@NetFlatten[NetTake[squeeze, "fire5"]]],
   "fire2" -> "f2u",
   "fire3" -> "f3u",
   "fire4" -> "f4u",
   "fire5" -> "f5u",
   {"f2u", "f3u", "f4u", "f5u"} -> "cat" -> "drop" -> "sig" -> "con"
   },
  "Input" -> NetEncoder[{"Image", dims, "ColorSpace" -> "RGB"}],
  "Output" -> NetDecoder[{"Class", Range[59], "InputDepth" -> 3}]
  ]

Now we can train this network. I have frozen the squeezenet weights, but it's not necessary (and not doing this will definitely result in a more accurate network).

net = NetTrain[tnet, data[[;; -30]], ValidationSet -> data[[-30 ;;]], 
  LearningRateMultipliers -> {"conv1" -> 0, "fire2" -> 0, 
    "fire3" -> 0, "fire4" -> 0}]

I trained this for a short time (5 rounds), let's see what it looks like:

That's not bad at all for such a short time training. Especially because there are many improvements you could make to this extremely off-the-cuff network:

You could consider weighting the classes, since the background class is so dominating, but I leave that to you.
you could not freeze the squeezenet weights - I only did this for training speed
you could use more of squeezenet - I only used about half of the full network. Simply chop it at "fire9" and add a few more skip connections.
you could use a different base network, although I have no idea how well that would work
you could scale the images up
you could train for much longer than I bothered to
you could try a different network altogether. There are a few segmentation networks on the neural net repository that might be worth looking at: https://resources.wolframcloud.com/NeuralNetRepository/tasktype/Semantic-Segmentation/
It seems like it would be relatively possible to simply use https://resources.wolframcloud.com/NeuralNetRepository/resources/Ademxapp-Model-A1-Trained-on-ADE20K-Data to do this, virtually out-of-the-box - and if it doesn't work straight away, doing a little transfer learning would probably lead to decent results.

Good luck!

For your”exercise”, are you just taking the decoded image and finding the nearest color for each label per pixel? — M.R., Jan 09 '19 at 22:27
Or are you combing the networks output with the input image and growing the regions with some sort of clustering components? What’s the method — M.R., Jan 09 '19 at 22:37
Yes, probably the first one (I would say it's significantly easier if you train for much longer with that network). The better method is to figure out how to train a pixel-level class predictor network. I'll work on it when I get the chance. — Carl Lange, Jan 09 '19 at 22:49
Is the last rule of images you show the output -> decoded output or are you just showing the mask from the file? — M.R., Jan 09 '19 at 23:10
I'm showing output->original data mask (I'm showing how poorly this network did - it was more an illustration than a working example) — Carl Lange, Jan 09 '19 at 23:22
I've now figured out class prediction, update to my answer incoming — Carl Lange, Jan 09 '19 at 23:37
@M.R. Just a note that I've updated my answer with class-based prediction. — Carl Lange, Jan 10 '19 at 01:24
Nice update, however, now I get this: NetTrain::invencin: Invalid input, input should be a list of dimensions {104,157}. — M.R., Jan 10 '19 at 01:51
Hi, Did you test it in Mathematica 12.1? I found I can't run your code well. After Run NetGraph step. NetGraph::invnetvert: The vertex fire2, specified in the edge fire2->f2u, does not exist. — HyperGroups, Aug 04 '20 at 06:39
Hmm, unfortunately I wrote this in 11.3. I will give it a look in the next day or two. It's possible that the Squeezenet definition has changed and you might need to modify it based on a different naming convention or something. — Carl Lange, Aug 04 '20 at 09:46

Fashion segmentation with a neural net?

1 Answers1

Linked