Object detection and localization using neural network

Question

Four important computer vision tasks are classification, localization, object detection and instance segmentation (image taken from cs224d course):

These four tasks are all built on top of the deep convolution neural network which allows effective feature extractions from images.

In Mathematica version 11.1, there are large pre-trained networks that can be used as feature extractors for computer vision tasks. So is it possible to build an object detection system that can detect multiple objects using the new neural network framework?

@M.R. I think you asked specifically about R-CNN, which I don't know how to implement. And I'm also very interested to see an answer from the community for your R-CNN question. For this question, I'm going to provide an answer using YOLO. Since YOLO is a regression-based method, it's much easier to implement than the classification-based method like R-CNN, and it runs much faster. Hopefully, this answer will be helpful for people who want to explore more about object detection using the neural network in Mathematica, and may eventually come up with an answer to your R-CNN question. — xslittlegrass, Apr 01 '17 at 23:57
Since posting this, have you figured out how to train such a network in Mma? I have never worked with NNs so I am not sure what applications are reasonable. I am looking to detect the locations of only one type of object in grayscale images. There's quite a bit of variation in the shape of these. Two of them (marked by red) look like this. Do you think the same approach you show here would be reasonable, in some simplified version? — Szabolcs, Aug 10 '18 at 12:00
@Szabolcs I think for simplicity, it may be worth trying to train a binary classifier to detect the object, and then sliding the classifier on patches of the image and combine the results to get the final detection. All this can be done with the existing NN framework in Mathematica. This is also similar to how the original RCNN works. The YOLO network here is more about performance than accuracy, at least that's what they emphasized in the original paper. — xslittlegrass, Aug 21 '18 at 05:09
Somewhat related QA https://mathematica.stackexchange.com/questions/172908/fashion-segmentation-with-a-neural-net/189168 — Carl Lange, Jul 20 '21 at 18:07

score 59 · Answer 1 · edited Apr 13 '17 at 12:55

Introduction

An object detection problem can be approached as either a classification problem or a regression problem. As a classification problem, the image is divided into small patches, each of which will be run through a classifier to determine whether there are objects in the patch. Then the bounding boxes will be assigned to locate around patches that are classified with a high probability of present of an object. In the regression approach, the whole image will be run through a convolutional neural network to directly generate one or more bounding boxes for objects in the images.

In this answer, we will build an object detector using the tiny version of the You Only Look Once (YOLO) approach.

Construct the YOLO network

The tiny YOLO v1 consists of 9 convolution layers and 3 full connected layers. Each convolution layer consists of convolution, leaky relu and max pooling operations. The first 9 convolution layers can be understood as the feature extractor, whereas the last three full connected layers can be understood as the "regression head" that predicts the bounding boxes.

There is no native leaky relu layer in Mathematica, but it can be constructed easily using a ElementwiseLayer

leayReLU[alpha_] := ElementwiseLayer[Ramp[#] - alpha*Ramp[-#] &]

with this, the YOLO network can be constructed as

YOLO = NetInitialize@NetChain[{
    ElementwiseLayer[2.*# - 1. &],
    ConvolutionLayer[16, 3, "PaddingSize" -> 1],
    leayReLU[0.1],
    PoolingLayer[2, "Stride" -> 2],
    ConvolutionLayer[32, 3, "PaddingSize" -> 1],
    leayReLU[0.1],
    PoolingLayer[2, "Stride" -> 2],
    ConvolutionLayer[64, 3, "PaddingSize" -> 1],
    leayReLU[0.1],
    PoolingLayer[2, "Stride" -> 2],
    ConvolutionLayer[128, 3, "PaddingSize" -> 1],
    leayReLU[0.1],
    PoolingLayer[2, "Stride" -> 2],
    ConvolutionLayer[256, 3, "PaddingSize" -> 1],
    leayReLU[0.1],
    PoolingLayer[2, "Stride" -> 2],
    ConvolutionLayer[512, 3, "PaddingSize" -> 1],
    leayReLU[0.1],
    PoolingLayer[2, "Stride" -> 2],
    ConvolutionLayer[1024, 3, "PaddingSize" -> 1],
    leayReLU[0.1],
    ConvolutionLayer[1024, 3, "PaddingSize" -> 1],
    leayReLU[0.1],
    ConvolutionLayer[1024, 3, "PaddingSize" -> 1],
    leayReLU[0.1],
    FlattenLayer[],
    LinearLayer[256],
    LinearLayer[4096],
    leayReLU[0.1],
    LinearLayer[1470]
    },
   "Input" -> NetEncoder[{"Image", {448, 448}}]
   ]

Load pre-trained weights

Training the YOLO network is time-consuming. We will use the pre-trained weights instead. The pre-trained weights can be downloaded as a binary file from here (172M).

Using NetExtract and NetReplacePart we can load the pre-trained weights into our model

modelWeights[net_, data_] := 
 Module[{newnet, as, weightPos, rule, layerIndex, linearIndex},

  layerIndex = 
   Flatten[Position[
     NetExtract[net, All], _ConvolutionLayer | _LinearLayer]];
  linearIndex = 
   Flatten[Position[NetExtract[net, All], _LinearLayer]];

  as = Flatten[
    Table[{{n, "Biases"} -> 
       Dimensions@NetExtract[net, {n, "Biases"}], {n, "Weights"} -> 
       Dimensions@NetExtract[net, {n, "Weights"}]}, {n, layerIndex}], 
    1];
  weightPos = # + {1, 0} & /@ 
    Partition[Prepend[Accumulate[Times @@@ as[[All, 2]]], 0], 2, 1];
  rule = Table[
    as[[n, 1]] -> 
     ArrayReshape[Take[data, weightPos[[n]]], as[[n, 2]]], {n, 1, 
     Length@as}];
  newnet = NetReplacePart[net, rule];
  newnet = NetReplacePart[newnet,
    Table[
     {n, "Weights"} -> 
      Transpose@
       ArrayReshape[NetExtract[newnet, {n, "Weights"}], 
        Reverse@Dimensions[NetExtract[newnet, {n, "Weights"}]]], {n, 
      linearIndex}]];
  newnet

  ]

data = BinaryReadList["yolo-tiny.weights", "Real32"][[5 ;; -1]];
YOLO = modelWeights[YOLO, data];

Post-processing

The output of this network is a 1470 vector, which contains the coordinates and confidence of the predicted bounding boxes for different classes. The tiny YOLO v1 is trained on the PASCAL VOC dataset which has 20 classes:

labels = {"aeroplane", "bicycle", "bird", "boat", "bottle", "bus", 
   "car", "cat", "chair", "cow", "diningtable", "dog", "horse", 
   "motorbike", "person", "pottedplant", "sheep", "sofa", "train", 
   "tvmonitor"};

And the information for the output vector from the network is organized in the following way

The 1470 vector output is divided into three parts, giving the probability, confidence and box coordinates. Each of these three parts is also further divided into 49 small regions, corresponding to the predictions at each cell. Each of the 49 cells will have two box predictions. In postprocessing steps, we take this 1470 vector output from the network to generate the boxes that with a probability higher than a certain threshold. The overlapping boxes will be resolved using the non-max suppression method.

coordToBox[center_, boxCord_, scaling_: 1] := Module[{bx, by, w, h},
  (*conver from {centerx,centery,width,height} to Rectangle object*)

   bx = (center[[1]] + boxCord[[1]])/7.;
  by = (center[[2]] + boxCord[[2]])/7.;
  w = boxCord[[3]]*scaling;
  h = boxCord[[4]]*scaling;
  Rectangle[{bx - w/2, by - h/2}, {bx + w/2, by + h/2}]
  ]

nonMaxSuppression[boxes_, overlapThreshold_, confidThreshold_] := 
 Module[{lth = Length@boxes, boxesSorted, boxi, boxj},
  (*non-max suppresion to eliminate overlapping boxes*)

  boxesSorted = 
   GroupBy[boxes, #class &][All, SortBy[#prob &] /* Reverse];
  Do[
   Do[
    boxi = boxesSorted[[c, n]];
    If[boxi["prob"] != 0,
     Do[
      boxj = boxesSorted[[c, m]];
      (*if two boxes overlap largely, 
      kill the box with low confidence*)

      If[RegionMeasure[
          RegionIntersection[boxi["coord"], boxj["coord"]]]/
         RegionMeasure[RegionUnion[boxi["coord"], boxj["coord"]]] >= 
        overlapThreshold,
       boxesSorted = ReplacePart[boxesSorted, {c, m, "prob"} -> 0]];
      , {m, n + 1, Length[boxesSorted[[c]]]}]
     ]
    , {n, 1, Length[boxesSorted[[c]]]}],
   {c, 1, Length@boxesSorted}

   ];
  boxesSorted[All, Select[#prob > 0 &]]]

labelBox[class_ -> box_] := Module[{coord, textCoord},
  (*convert class\[Rule]boxes to labeled boxes*)
  coord = List @@ box;
  textCoord = {(coord[[1, 1]] + coord[[2, 1]])/2., 
    coord[[1, 2]] - 0.04};
  {{GeometricTransformation[
     Text[Style[labels[[class]], 30, Blue], textCoord], 
     ReflectionTransform[{0, 1}, textCoord]]}, 
   EdgeForm[Directive[Red, Thick]], Transparent, box}

  ]

drawBoxes[img_, boxes_] := Module[{labeledBoxes},
  (*draw boxes with labels*)

  labeledBoxes = 
   labelBox /@ 
    Flatten[Thread /@ Normal@Normal@boxes[All, All, "coord"]];
  Graphics[
   GeometricTransformation[{Raster[ImageData[img], {{0, 0}, {1, 1}}], 
     labeledBoxes}, ReflectionTransform[{0, 1}, {0, 1/2}]]]
  ]

postProcess[img_, vec_, boxScaling_: 0.7, confidentThreshold_: 0.15, 
  overlapThreshold_: 0.4] := 
 Module[{grid, prob, confid, boxCoord, boxes, boxNonMax},
  grid = Flatten[Table[{i, j}, {j, 0, 6}, {i, 0, 6}], 1];
  prob = Partition[vec[[1 ;; 980]], 20];
  confid = Partition[vec[[980 + 1 ;; 980 + 98]], 2];
  boxCoord = ArrayReshape[vec[[980 + 98 + 1 ;; -1]], {49, 2, 4}];
  boxes = Dataset@Select[Flatten@Table[
       <|"coord" -> 
         coordToBox[grid[[i]], boxCoord[[i, b]], boxScaling], 
        "class" -> c, 
        "prob" -> If[# <= confidentThreshold, 0, #] &@(prob[[i, c]]*
           confid[[i, b]])|>
       , {c, 1, 20}, {b, 1, 2}, {i, 1, 49}
       ], #prob >= confidentThreshold &];
  boxNonMax = 
   nonMaxSuppression[boxes, overlapThreshold, confidentThreshold];
  drawBoxes[Image[img], boxNonMax]
  ]

Results

These are the results for this network.

urls = {"http://i.imgur.com/n2u0N3K.jpg", 
   "http://i.imgur.com/Bpb60U1.jpg", "http://i.imgur.com/CMZ6Qer.jpg",
    "http://i.imgur.com/lnEE8C7.jpg"};

imgs = Import /@ urls

With[{i = ImageResize[#, {448, 448}]}, postProcess[i, YOLO[i]]] & /@ 
  imgs // ImageCollage

Reference

Vehicle detection using YOLO in Keras, https://github.com/xslittlegrass/CarND-Vehicle-Detection
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time Object Detection, arXiv:1506.02640 (2015).
J. Redmon and A. Farhadi, YOLO9000: Better, Faster, Stronger, arXiv:1612.08242 (2016).
darkflow, https://github.com/thtrieu/darkflow
Darknet.keras, https://github.com/sunshineatnoon/Darknet.keras/
YAD2K, https://github.com/allanzelener/YAD2K

Great post! Mathematica 11.1 has undocumented functions ImageCases and ImageContainsQ but they don't work. They should do the same (first image). — Alexey Golyshev, Apr 02 '17 at 03:52
@xslittlegrass thank you for this wonderful post. When I run the last code net is undefined. If I change to YOLO there is a problem with the labels. I seek clarification. — ubpdqn, Apr 02 '17 at 08:15
@ubpdqn Thanks for catching that. I have added the labels now. — xslittlegrass, Apr 02 '17 at 15:08
@xslittlegrass can you possibly update your solution to show how you would train this in mma? — M.R., Apr 03 '17 at 17:28
@M.R. This solution use the pre-trained weights from the original darknet implementation. I don't know how to train this yet, but I'll update this answer when I figure it out. — xslittlegrass, Apr 03 '17 at 17:57
There are 7x7=49 grids in total, and each grid is responsible for 2 bounding boxes. So, there should be 49x2 = 98 predictions. As there are 20 categories, the probabilities part should have 98 * 20 = 1960 values. Why it is 980 in the YOLO tiny output? — Purboo, May 06 '18 at 12:50
Nice !!! Feel free to post on Wolfram Community - we will add it to Staff Picks. — Vitaliy Kaurov, Jun 19 '18 at 21:17

score 11 · Answer 2 · edited Oct 09 '17 at 21:21

Mathematica 11.2 has undocumented functions: ImageContents, ImageCases, ImageBoundingBoxes, ImageContainsQ, ImagePositions.

First of all, we should download the network. The functions should do it automatically but they do not.

NetModel["YOLO V2 Trained on MS COCO Data"]

img = Import["https://i.stack.imgur.com/440g2.jpg"]

ImageContents[img]

ImageCases[img, Entity["Concept", "Zebra::nx5qr"]]

ImageBoundingBoxes[img, Entity["Concept", "Zebra::nx5qr"]]

ImageContainsQ[img, Entity["Concept", "Zebra::nx5qr"]]

False

False?

List of all available objects is not very big and includes only 80 entities.

C:\Program Files\Wolfram Research\Mathematica\11.2\SystemFiles\Components\NeuralFunctions\Resources\ObjectDetection

{Entity["Concept", "Person::93r37"], Entity["Concept", "Bicycle::26wj9"], 
 Entity["Concept", "Auto::p735c"], Entity["Concept", "Motorcycle::4gd85"], 
 Entity["Concept", "Aeroplane::239dv"], Entity["Concept", "Autobus::v5t2j"], 
 Entity["Concept", "RailroadTrain::y7363"], 
 Entity["Concept", "Motortruck::9c7br"], Entity["Concept", "Boat::fx4n4"], 
 Entity["Concept", "TrafficLight::b4966"], 
 Entity["Concept", "FireHydrant::x7gzw"], Entity["Concept", "Stop::33h2c"], 
 Entity["Concept", "ParkingMeter::h445p"], Entity["Concept", "Bench::995ph"], 
 Entity["Concept", "Bird::56ny3"], Entity["Concept", "DomesticCat::jpx55"], 
 Entity["Concept", "CanisFamiliaris::597qc"], 
 Entity["Concept", "EquusCaballus::x93n2"], 
 Entity["Concept", "Sheep::9t384"], Entity["Concept", "Cow::3cf9b"], 
 Entity["Concept", "Elephant::72b52"], Entity["Concept", "Bear::s7w85"], 
 Entity["Concept", "Zebra::nx5qr"], Entity["Concept", "Camelopard::6b787"], 
 Entity["Concept", "BackPack::2cjsp"], Entity["Concept", "Umbrella::3mntq"], 
 Entity["Concept", "Handbag::76k33"], Entity["Concept", "Necktie::67p5x"], 
 Entity["Concept", "Suitcase::vzm4n"], Entity["Concept", "Frisbee::9h9t9"], 
 Entity["Concept", "Ski::5nbj2"], Entity["Concept", "Snowboard::25v3y"], 
 Entity["Concept", "Ball::279y7"], Entity["Concept", "Kite::789p6"], 
 Entity["Concept", "BaseballBat::2n8qr"], 
 Entity["Concept", "BaseballGlove::6c78v"], 
 Entity["Concept", "Skateboard::wvjx6"], Entity["Concept", 
  "Surfboard::367w8"], Entity["Concept", "TennisRacket::zbf7z"], 
 Entity["Concept", "Bottle::998tj"], Entity["Concept", 
  "DrinkingGlass::339c3"], Entity["Concept", "Cup::w4d7b"], 
 Entity["Concept", "Fork::9hxp6"], Entity["Concept", "Knife::9529v"], 
 Entity["Concept", "Spoon::4v83p"], Entity["Concept", "Bowl::8mv4z"], 
 Entity["Concept", "Banana::f6z73"], Entity["Concept", "Apple::857h7"], 
 Entity["Concept", "Sandwich::q3j9z"], Entity["Concept", "Orange::w579d"], 
 Entity["Concept", "BrassicaOleraceaItalica::56n93"], 
 Entity["Concept", "Carrot::7s555"], Entity["Concept", "RedHot::3j848"], 
 Entity["Concept", "Pizza::56w88"], Entity["Concept", "Donut::332td"], 
 Entity["Concept", "Cake::pq6mk"], Entity["Concept", "Chair::ptj48"], 
 Entity["Concept", "Sofa::kfh45"], Entity["Concept", "PotPlant::f84td"], 
 Entity["Concept", "Bed::3924t"], Entity["Concept", "DiningTable::6ypqf"], 
 Entity["Concept", "FlushToilet::c8456"], 
 Entity["Concept", "BoobTube::b9272"], Entity["Concept", "Laptop::zdd33"], 
 Entity["Concept", "Mouse::5wd97"], Entity["Concept", "Remote::h4drx"], 
 Entity["Concept", "Keyboard::yzdk7"], Entity["Concept", "Cellphone::5k4s4"], 
 Entity["Concept", "MicrowaveOven::fs7tb"], Entity["Concept", "Oven::8t665"], 
 Entity["Concept", "Toaster::rp6v2"], Entity["Concept", "Sink::4xnh2"], 
 Entity["Concept", "ElectricRefrigerator::t4bt7"], 
 Entity["Concept", "Book::t8bc6"], Entity["Concept", "Clock::jq868"], 
 Entity["Concept", "Vase::4x594"], Entity["Concept", 
  "PairOfScissors::3gyct"], Entity["Concept", "TeddyBear::f56q9"], 
 Entity["Concept", "BlowDrier::t4dpz"], Entity["Concept", 
  "Toothbrush::7p83q"]}

@partida C:\Program Files\Wolfram Research\Mathematica\11.2\SystemFiles\Components\NeuralFunctions\Image\Detection\ObjectDetection.m, line 216 net := GetNetModel["YOLO V2 Trained on MS COCO Data"]; — Alexey Golyshev, Sep 16 '17 at 15:06
@AlexeyGolyshev That's a nice discovery. It would be great if there are implementations of training of YOLO somewhere in the system. — xslittlegrass, Sep 18 '17 at 16:42