top of page


Author: B. Sandhya Reddy

The YOLO framework (You Only Look Once) deals with object detection in a different way. It takes the entire image in a single instance and predicts the bounding box coordinates and class probabilities for these boxes. The biggest advantage of using YOLO is its simple and not complicated – it’s incredibly fast and can process 45 frames per second. YOLO also understands generalized object representation. This is one of the best algorithms for object detection and has shown a comparatively similar performance to the R-CNN algorithms.


  • YOLO first takes an input image:


  • The framework then divides the input image into grids (say a 3 X 3 grid):


Image classification and localization are applied on each grid. YOLO then predicts the bounding boxes and their corresponding class probabilities for objects (if any are found). Suppose we have divided the image into a grid of size 3 X 3 and there are a total of 3 classes which we want the objects to be classified into. Let’s say the classes are Pedestrian, Car, and Motorcycle respectively. So, for each grid cell, the label y will be an eight dimensional vector:



  • pc defines whether an object is present in the grid or not (it is the probability)

  • bx, by, bh, bw specify the bounding box if there is an object

  • c1, c2, c3 represent the classes. So, if the object is a car, c2 will be 1 and c1 & c3 will be 0, and so on.

Let’s say we select the first grid from the above example:


Since there is no object in this grid, pc will be zero and the y label for this grid will be:


The bx, by, bh, bw, c1, c2, and c3 contain as there is no object in the grid. Let’s take another grid in which we have a car (c2 = 1):


YOLO decides whether there actually is an object in the grid. In the image, there are two objects (two cars), so YOLO will take the mid-point of these two objects and these objects will be assigned to the grid which contains the mid-point of these objects. The y label for the centre left grid with the car will be:


Since there is an object in this grid, pc will be equal to 1. bx, by, bh, bw will be calculated relative to the particular grid cell we are dealing with. Since car is the second class, c2 = 1 and c1 and c3 = 0. So, for each of the 9 grids, we will have an eight dimensional output vector. This output will have a shape of 3 X 3 X 8. So now we have an input image and it’s corresponding target vector. Using the above example (input image – 100 X 100 X 3, output – 3 X 3 X 8), our model will be trained as follows:



The bx, by, bh, and bw are calculated relative to the grid cell we are dealing with. Consider the center-right grid which contains a car:


So, bx, by, bh, and bw will be calculated relative to this grid only. The y label for this grid will be:


pc = 1 since there is an object in this grid and since it is a car, c2 = 1. In YOLO, the coordinates assigned to all the grids (bx, by, bh, and bw)) are:


The bx, by are the x and y coordinates of the midpoint of the object with respect to this grid. In this case, it will be (around) bx = 0.4 and by = 0.3:


Now, bh is the ratio of the height of the bounding box (red box in the above example) to the height of the corresponding grid cell, which in our case is around 0.9. So, bh = 0.9. bw is the ratio of the width of the bounding box to the width of the grid cell. So, bw = 0.5 (approximately). The y label for this grid will be:


The bx and by will always range between 0 and 1 as the midpoint will always lie within the grid. Whereas bh and bw can be more than 1 in case the dimensions of the bounding box are more than the dimension of the grid.


Non- suppression helps to decide whether the predicted bounding box is giving us a good outcome (or a bad one). It calculates the intersection over union of the actual bounding box and the predicted bonding box. Consider the actual and predicted bounding boxes for a car as shown below:


The red box is the actual bounding box and the blue box is the predicted one. Intersection over Union, will calculate the area of the intersection over union of these two boxes. That area will be:


IoU = Area of the intersection / Area of the union, i.e.

IoU = Area of yellow box / Area of green box

If IoU is greater than 0.5, we can say that the prediction is good enough. 0.5 is an arbitrary threshold we have taken here, but it can be changed according to your specific problem. Intuitively, the more you increase the threshold, the better the predictions become.


One of the most common problems with object detection algorithms is that rather than detecting an object just once, they might detect it multiple times. Consider the below image:


Here, the cars are identified more than once. The Non-Max Suppression technique cleans up this up so that we get only a single detection per object:

1.It first looks at the probabilities associated with each detection and takes the largest one. In the above image, 0.9 is the highest probability, so the box with 0.9 probability will be selected first:


2.Now, it looks at all the other boxes in the image. The boxes which have high IoU with the current box are suppressed. So, the boxes with 0.6 and 0.7 probabilities will be suppressed in our example:


4. Again it will look at the IoU of this box with the remaining boxes and compress the boxes with a high IoU.ability, which is 0.8 in our case:


5. We repeat these steps until all the boxes have either been selected or compressed and we get the final bounding boxes:


The algorithm just takes the boxes with maximum probability and suppressing the close-by boxes with non-max probabilities. Summary about the Non-Max suppression algorithm:

1. Discard all the boxes having probabilities less than or equal to a pre-defined threshold (say, 0.5).

2. For the remaining boxes:

  • Pick the box with the highest probability and take that as the output prediction.

  • Discard any other box which has IoU greater than the threshold with the output box from the above step.

  • Repeat step 2 until all the boxes are either taken as the output prediction or discarded.

  • Anchor Boxes method can also be used to improve the performance of a YOLO algorithm.


The input for training our model will obviously be images and their corresponding y labels. Let’s see an image and make its y label:


  • Consider the scenario where we are using a 3 X 3 grid with two anchors per grid, and there are 3 different object classes.

  • So the corresponding y labels will have a shape of 3 X 3 X 16. Now, suppose if we use 5 anchor boxes per grid and the number of classes has been increased to 5.

  • So the target will be 3 X 3 X 10 X 5 = 3 X 3 X 50. This is how the training process is done – taking an image of a particular shape and mapping it with a 3 X 3 X 16 target (this may change as per the grid size, number of anchor boxes and the number of classes).


  • The new image will be divided into the same number of grids which we have chosen during the training period. For each grid, the model will predict an output of shape 3 X 3 X 16 (assuming this is the shape of the target during training time).

  • The 16 values in this prediction will be in the same format as that of the training label.

  • The first 8 values will correspond to anchor box 1, where the first value will be the probability of an object in that grid.

  • Values 2-5 will be the bounding box coordinates for that object, and the last three values will tell us which class the object belongs to.

  • The next 8 values will be for anchor box 2 and in the same format, i.e., first the probability, then the bounding box coordinates, and finally the classes.

  • Finally, the Non-Max Suppression technique will be applied on the predicted boxes to obtain a single prediction per object.

  • Exact dimensions and steps that the YOLO algorithm follows:

1.Takes an input image of shape (608, 608, 3).

2.Passes this image to a convolutional neural network (CNN), which returns a(19,19, 5, 85) dimensional output.

3.The last two dimensions of the above output are flattened to get output volume of (19, 19, 425):

  • Here, each cell of a 19 X 19 grid returns 425 numbers.

  • 425 = 5* 85, where 5 is the number of anchor boxes per grid.

  • 85 = 5 +80, where 5 is (pc, bx, by, bh, bw) and 80 is the number of classes we want to detect.

4.Finally, do the IoU and Non-Max Suppression to avoid selecting overlapping boxes.


Step-1: Import the required libraries.

Step-2: Create a function for filtering the boxes based on their probabilities and threshold:

def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
    box_scores = box_confidence*box_class_probs
    box_classes = K.argmax(box_scores,-1)
    box_class_scores = K.max(box_scores,-1)
    filtering_mask = box_class_scores>threshold
    scores = tf.boolean_mask(box_class_scores,filtering_mask)
    boxes = tf.boolean_mask(boxes,filtering_mask)
    classes = tf.boolean_mask(box_classes,filtering_mask)
    return scores, boxes, classes

Step-3: Define a function to calculate the IoU between two boxes:

def iou(box1, box2):
    xi1 = max(box1[0],box2[0])
    yi1 = max(box1[1],box2[1])
    xi2 = min(box1[2],box2[2])
    yi2 = min(box1[3],box2[3])
    inter_area = (yi2-yi1)*(xi2-xi1)
    box1_area = (box1[3]-box1[1])*(box1[2]-box1[0])
    box2_area = (box2[3]-box2[1])*(box2[2]-box2[0])
    union_area = box1_area+box2_area-inter_area
    iou = inter_area/union_area
    return iou

Step-4: Define a function for Non-Max Suppression.

Step-5: Create a random volume of shape (19,19,5,85) and then predict the bounding boxes.

Step-6: Finally, we will define a function which will take the outputs of a CNN as input and return the suppressed boxes:

Step-7: Use the yolo_eval function to make predictions for a random volume:

scores, boxes, classes = yolo_eval(yolo_outputs)

Step-8: Use a pretrained YOLO algorithm on new images and see how it works:

sess = K.get_session()
class_names = read_classes("model_data/coco_classes.txt")
anchors = read_anchors("model_data/yolo_anchors.txt")
yolo_model = load_model("model_data/yolo.h5")

Step-9: Define a function to predict the bounding boxes and save the images with these bounding boxes included.

Step-10: Read an image and make predictions using the predict function:

img = plt.imread('images/img.jpg')
image_shape = float(img.shape[0]), float(img.shape[1])
scores, boxes, classes = yolo_eval(yolo_outputs, image_shape)

Step-11: Plot the predictions:

out_scores, out_boxes, out_classes = predict(sess, "img.jpg")


For the complete implementation of the YOLO algorithm check out the below


  • YOLO is a state-of-the-art object detection algorithm that is incredibly fast and accurate

  • We send an input image to a CNN which outputs a 19 X 19 X 5 X 85 dimension volume.

  • Here, the grid size is 19 X 19 and each grid contains 5 boxes

  • We filter through all the boxes using Non-Max Suppression, keep only the accurate boxes, and also eliminate overlapping boxes

GitHub link:


9,464 views0 comments

Recent Posts

See All
bottom of page