How to do the object detection? An naive idea is,
- taking different regions of interest from the image
- using a CNN to classify the presence of the object in that region
Therefore, algorithms like R-CNN, YOLO etc have been developed to find these occurrences and find them fast.
R-CNN
Keywords: selective search; region proposals;
Procedures
- Region proposals: extract just 2000 regions from the image by the selective search algorithm (i.e. region proposals).
- Feature vectors: warp each region proposal into a square and fead into a CNN, producing a 4096-dimensional feature vector.
- Classify: feed SVM with each feature vector to classify the presence of the object within the candidate region proposal.
Advantages
- Bypass the problem of selecting a huge number of regions
Disadvantages
- Huge training time: classifying 2000 region proposals per image
- Non-realtime: taking around 47 seconds for each test image
- No learning:
selective searchis a fixed algorithm
Fast R-CNN
Keywords: feature map; RoI pooling layer;
Procedures
- Feature map: instead of feeding the region proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature map.
- Region proposals: using selective search to get region proposals.
- RoI pooling layer: from the convolutional feature map, we
identify the region of proposalsand warp them into squares and by using a RoI pooling layer we reshape them into a fixed size so that it can be fed into a fully connected layer. From the RoI feature vector, we use a softmax layer to predict the class of the proposed region and also the offset values for the bounding box.
Advantages
- Fast R-CNN is faster than R-CNN because the convolution operation is done only once per image and a feature map is generated from it.
Disadvantages
- Region proposals become bottlenecks in Fast R-CNN, affecting its performance.
- Selective search is a slow and time-consuming process.
Faster R-CNN
Keywords: region proposal network;
Procedures
- Feature map: similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map.
- Region proposals: a separate network is used to predict the region proposals, instead of using selectve search.
- RoI pooling layer: The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes.
Advantages
- Faster than fast-rcnn.
Disadvantages
- Still of two-stage and based on the region proposals.
YOLO
Keywords: Split image; Bbox probability; Spatial constraints;
Procedures
- Split the image: we take an image and split it into an SxS grid, within each of the grid we take m bounding boxes.
- BBox probability: for each of the bounding box, the network outputs a class probability and offset values for the bounding box.
- Locate objects: the bounding boxes having the class probability above a threshold value is selected and used to locate the object within the image.
Advantages
- Orders of magnitude faster(45 frames per second) than the RCNNs.
Disadvantages
- It struggles with small objects within the image, like a flock of birds.
- Due to the spatial constraints of the algorithm.
Note: The following content comes from towards data science