YOLO V3 ML Project
YOLO V3 ML Project
YOLO V3 ML Project
On
Convolutional Neural Network for Object Detection(YOLO)
Submitted By-
Abstract
Introduction
Background
Dataset
Libraries Used
Bibliography
Abstract
The goal of object detection is to detect all instances of objects from a known class, such as
people, cars or faces in an image. Typically only a small number of instances of the object are
present in the image, but there is a very large number of possible locations and scales at which they
can occur and that need to somehow be explored. Each detection is reported with some form of pose
information. This could be as simple as the location of the object, a location and scale, or the extent
of the object defined in terms of a bounding box. In other situations the pose information is more
detailed and contains the parameters of a linear or non-linear transformation. For example a face
detector may compute the locations of the eyes, nose and mouth, in addition to the bounding box of
the face. An example of a bicycle detection that specifies the locations of certain parts is shown in
Figure 1. The pose could also be defined by a three-dimensional transformation specifying the
location of the object relative to the camera. Object detection systems construct a model for an object
class from a set of training examples. In the case of a fixed rigid object only one example may be
needed, but more generally multiple training examples are necessary to capture certain aspects of
class variability.
Object detection is a computer technology related to computer vision and image processing that detects
and defines objects such as humans, buildings and cars from digital images and videos (MATLAB).
This technology has the power to classify just one or several objects within a digital image at once.
Object detection has been around for years, but is becoming more apparent across a range of industries
now more than ever before.
You can take a classifier like VGGNet or Inception and turn it into an object detector by sliding a
small window across the image. At each step you run the classifier to get a prediction of what sort of
object is inside the current window. Using a sliding window gives several hundred or thousand
predictions for that image, but you only keep the ones the classifier is the most certain about.
This approach works but it’s obviously going to be very slow, since you need to run the classifier
many times. A slightly more efficient approach is to first predict which parts of the image contain
interesting information — so-called region proposals — and then run the classifier only on these
regions. The classifier has to do less work than with the sliding windows but still gets run many
times over.
YOLO takes a completely different approach. It’s not a traditional classifier that is repurposed to be
an object detector. YOLO actually looks at the image just once (hence its name: You Only Look
Once) but in a clever way.
Introduction
Object detection is one of the classical problems in computer vision where you work to
recognize what and where — specifically what objects are inside a given image and also where they are
in the image. The problem of object detection is more complex than classification, which also can
recognize objects but doesn’t indicate where the object is located in the image. In addition, classification
doesn’t work on images containing more than one object.
YOLO uses a totally different approach. YOLO is a clever convolutional neural network (CNN) for
doing object detection in real-time. The algorithm applies a single neural network to the full image, and
then divides the image into regions and predicts bounding boxes and probabilities for each region. These
bounding boxes are weighted by the predicted probabilities.
YOLO is popular because it achieves high accuracy while also being able to run in real-time. The
algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass
through the neural network to make predictions. After non-max suppression (which makes sure the
object detection algorithm only detects each object once), it then outputs recognized objects together
with the bounding boxes.
With YOLO, a single CNN simultaneously predicts multiple bounding boxes and class probabilities for
those boxes. YOLO trains on full images and directly optimizes detection performance.
As with a lot of research work in deep learning, much of the effort is trial and error. In pursuit of
YOLOv3, this effect was in force as the team tried a lot of different ideas, but many of them didn’t work
out. A few of the things that stuck include: a new network for performing feature extraction consisting of
53 convolutional layers, a new detection metric, predicting an “objectness” score for each bounding box
using logistic regression, and using binary cross-entropy loss for the class predictions during training.
The end result is that YOLOv3 runs significantly faster than other detection methods with comparable
performance. In addition, YOLO no longer struggles with small objects.
Background
Artificial neural network (ANN) or connectionist systems are computing systems that are inspired
by, but not identical to biological neural networks that constitute animal brains. Such systems
“learn” to perform tasks by considering examples, generally without being explicitly programmed
with task specific rules. For example in image recognition, they might learn to identify images that
contain cats by analyzing examples images that have been manually labelled as “cat” or “cat” and
using the result to identify cats in other images.
An ANN is based on a collection of connected units or nodes called artificial
neuron, which loosely model the neurons in a biological brain. The neural network can be classified
in many types like single layer feed forward networks, multilayer feed forward network, multilayer
perceptron, convolutional neural network, recurrent neural network, radial basis function neural
network and many other. The neural network used in our project is feed forward neural network
with backward propagation. In our multilayer perceptron the input layer reads in 50 features
contributed by an instrument music. The hidden layer with sigmoid function has 30 hidden nodes,
reducing the feature dimension to 30. The activation function for the output layer is the softmax
function which gives the probability distribution over output labels. During training the model the
main step is the backpropagation which is a method to adjust the connection weights to compensate
for each error found during learning. The error amount is effectively divided among the
connections. Technically backpropagation calculates the gradient (the derivative) of the cost
function associated with respect to the weights. The weights updates can be done via stochastic
gradient descent or other methods, such as extreme learning machines. “No prop networks”, training
without backtracking, weightless network, and non- connectionist neural networks.
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks,
most commonly applied to analyzing visual imagery. CNNs are regularized versions of multilayer
perceptrons. Multilayer perceptrons usually mean fully connected networks, that is, each neuron in one
layer is connected to all neurons in the next layer. The "fully-connectedness" of these networks makes
them prone to overfitting data. Typical ways of regularization include adding some form of magnitude
measurement of weights to the loss function. However, CNNs take a different approach towards
regularization: they take advantage of the hierarchical pattern in data and assemble more complex
patterns using smaller and simpler patterns. Therefore, on the scale of connectedness and complexity,
CNNs are on the lower extreme.
This Model is built upon Fastai library which sits upon Pytorch framework. It is a quite easy to
use library for implementing complicated models for newbies like us.
A bounding box has a very specific definition which is it’s a rectangle and the rectangle
has the object entirely fitting within it but it is no bigger than it has to be. This task is
quite complicated in itself. Here we are supposed to predict the position as well size of
the recognized object and make a box around it using regression.
Dataset: -
We will be looking at the Pascal VOC dataset. There are two different competition/research
datasets, from 2007 and 2012. We’ll be using the 2007 version. You can use the larger 2012
for better results, or even combine them .
Fastai
The fastai library simplifies training fast and accurate neural nets using
modern best practices. It's based on research in to deep learning best
practices undertaken at fast.ai, including "out of the box" support
for vision, text, tabular, and collab (collaborative filtering) models
PATH
PILLOW
MATPLOTLIB
Library Import
Dataset Setup
Drawing Box