YOLO V3 ML Project

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Project

On
Convolutional Neural Network for Object Detection(YOLO)

Submitted By-

Registration No. Name

11711777 Harshal Nandigramwar

11711818 Rishi Raj Singh

Programme Name: Computer Science and Engineering

Under the Guidance of

Dr. Ashmita Pandey


School of Computer Science and Engineering
Lovely Professional University, Phagwara, Punjab (Aug-Nov, 2019)
INDEX

Abstract

Introduction

Background

 Artificial Neural Network

 Convolutional Neural Network

How This Works

What We Have Done

 Main Tasks To Perform

 Dataset

 Libraries Used

 Code Snippets and Screenshots

Bibliography
Abstract

The goal of object detection is to detect all instances of objects from a known class, such as
people, cars or faces in an image. Typically only a small number of instances of the object are
present in the image, but there is a very large number of possible locations and scales at which they
can occur and that need to somehow be explored. Each detection is reported with some form of pose
information. This could be as simple as the location of the object, a location and scale, or the extent
of the object defined in terms of a bounding box. In other situations the pose information is more
detailed and contains the parameters of a linear or non-linear transformation. For example a face
detector may compute the locations of the eyes, nose and mouth, in addition to the bounding box of
the face. An example of a bicycle detection that specifies the locations of certain parts is shown in
Figure 1. The pose could also be defined by a three-dimensional transformation specifying the
location of the object relative to the camera. Object detection systems construct a model for an object
class from a set of training examples. In the case of a fixed rigid object only one example may be
needed, but more generally multiple training examples are necessary to capture certain aspects of
class variability.

Object detection is a computer technology related to computer vision and image processing that detects
and defines objects such as humans, buildings and cars from digital images and videos (MATLAB).
This technology has the power to classify just one or several objects within a digital image at once.
Object detection has been around for years, but is becoming more apparent across a range of industries
now more than ever before.

You can take a classifier like VGGNet or Inception and turn it into an object detector by sliding a
small window across the image. At each step you run the classifier to get a prediction of what sort of
object is inside the current window. Using a sliding window gives several hundred or thousand
predictions for that image, but you only keep the ones the classifier is the most certain about.

This approach works but it’s obviously going to be very slow, since you need to run the classifier
many times. A slightly more efficient approach is to first predict which parts of the image contain
interesting information — so-called region proposals — and then run the classifier only on these
regions. The classifier has to do less work than with the sliding windows but still gets run many
times over.

YOLO takes a completely different approach. It’s not a traditional classifier that is repurposed to be
an object detector. YOLO actually looks at the image just once (hence its name: You Only Look
Once) but in a clever way.
Introduction
Object detection is one of the classical problems in computer vision where you work to
recognize what and where — specifically what objects are inside a given image and also where they are
in the image. The problem of object detection is more complex than classification, which also can
recognize objects but doesn’t indicate where the object is located in the image. In addition, classification
doesn’t work on images containing more than one object.

YOLO uses a totally different approach. YOLO is a clever convolutional neural network (CNN) for
doing object detection in real-time. The algorithm applies a single neural network to the full image, and
then divides the image into regions and predicts bounding boxes and probabilities for each region. These
bounding boxes are weighted by the predicted probabilities.

YOLO is popular because it achieves high accuracy while also being able to run in real-time. The
algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass
through the neural network to make predictions. After non-max suppression (which makes sure the
object detection algorithm only detects each object once), it then outputs recognized objects together
with the bounding boxes.

With YOLO, a single CNN simultaneously predicts multiple bounding boxes and class probabilities for
those boxes. YOLO trains on full images and directly optimizes detection performance.

As with a lot of research work in deep learning, much of the effort is trial and error. In pursuit of
YOLOv3, this effect was in force as the team tried a lot of different ideas, but many of them didn’t work
out. A few of the things that stuck include: a new network for performing feature extraction consisting of
53 convolutional layers, a new detection metric, predicting an “objectness” score for each bounding box
using logistic regression, and using binary cross-entropy loss for the class predictions during training.
The end result is that YOLOv3 runs significantly faster than other detection methods with comparable
performance. In addition, YOLO no longer struggles with small objects.
Background

1. Artificial Neural Network:

Artificial neural network (ANN) or connectionist systems are computing systems that are inspired
by, but not identical to biological neural networks that constitute animal brains. Such systems
“learn” to perform tasks by considering examples, generally without being explicitly programmed
with task specific rules. For example in image recognition, they might learn to identify images that
contain cats by analyzing examples images that have been manually labelled as “cat” or “cat” and
using the result to identify cats in other images.
An ANN is based on a collection of connected units or nodes called artificial
neuron, which loosely model the neurons in a biological brain. The neural network can be classified
in many types like single layer feed forward networks, multilayer feed forward network, multilayer
perceptron, convolutional neural network, recurrent neural network, radial basis function neural
network and many other. The neural network used in our project is feed forward neural network
with backward propagation. In our multilayer perceptron the input layer reads in 50 features
contributed by an instrument music. The hidden layer with sigmoid function has 30 hidden nodes,
reducing the feature dimension to 30. The activation function for the output layer is the softmax
function which gives the probability distribution over output labels. During training the model the
main step is the backpropagation which is a method to adjust the connection weights to compensate
for each error found during learning. The error amount is effectively divided among the
connections. Technically backpropagation calculates the gradient (the derivative) of the cost
function associated with respect to the weights. The weights updates can be done via stochastic
gradient descent or other methods, such as extreme learning machines. “No prop networks”, training
without backtracking, weightless network, and non- connectionist neural networks.

2. Convolutional Neural Network (CNN):

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural networks,
most commonly applied to analyzing visual imagery. CNNs are regularized versions of multilayer
perceptrons. Multilayer perceptrons usually mean fully connected networks, that is, each neuron in one
layer is connected to all neurons in the next layer. The "fully-connectedness" of these networks makes
them prone to overfitting data. Typical ways of regularization include adding some form of magnitude
measurement of weights to the loss function. However, CNNs take a different approach towards
regularization: they take advantage of the hierarchical pattern in data and assemble more complex
patterns using smaller and simpler patterns. Therefore, on the scale of connectedness and complexity,
CNNs are on the lower extreme.

How This Works :-


Let’s start with our own testing image below.

The objects detected by YOLO:


What We Have Done:-

This Model is built upon Fastai library which sits upon Pytorch framework. It is a quite easy to
use library for implementing complicated models for newbies like us.

 Main Tasks To Perform: -

 We have multiple things that we are classifying.

Multi-label classification is a type of classification in which an object can be


categorized into more than one class. For example, In the above dataset, we will
classify a picture as the image of a dog or cat and also classify the same image based
on the breed of the dog or cat.

 Bounding boxes around what we are classifying.

A bounding box has a very specific definition which is it’s a rectangle and the rectangle
has the object entirely fitting within it but it is no bigger than it has to be. This task is
quite complicated in itself. Here we are supposed to predict the position as well size of
the recognized object and make a box around it using regression.
 Dataset: -

We will be looking at the Pascal VOC dataset. There are two different competition/research
datasets, from 2007 and 2012. We’ll be using the 2007 version. You can use the larger 2012
for better results, or even combine them .

 Some exemplar Photos:-


 Libraries Used: -

 Fastai

The fastai library simplifies training fast and accurate neural nets using
modern best practices. It's based on research in to deep learning best
practices undertaken at fast.ai, including "out of the box" support
for vision, text, tabular, and collab (collaborative filtering) models

 PATH

This module offers classes representing filesystem paths with semantics


appropriate for different operating systems. Path classes are divided
between pure paths, which provide purely computational operations without
I/O, and concrete paths, which inherit from pure paths but also provide I/O
operations.

 JSON (JavaScript Object Notation)

specified by RFC 7159 (which obsoletes RFC 4627) and by ECMA-404, is a


lightweight data interchange format inspired by JavaScript object literal syntax
(although it is not a strict subset of JavaScript 1 ).

 PILLOW

Python Imaging Library (abbreviated as PIL) (in newer versions known as


Pillow) is a free library for the Python programming language that adds
support for opening, manipulating, and saving many different image file
formats. It is available for Windows, Mac OS X and Linux. The latest
version of PIL is 1.1.7, was released in September 2009 and supports Python
1.5.2–2.7, with Python 3 support to be released "later".

 MATPLOTLIB

Matplotlib is a Python 2D plotting library which produces publication


quality figures in a variety of hardcopy formats and interactive environments
across platforms. Matplotlib can be used in Python scripts, the Python
and IPython shells, the Jupyter notebook, web application servers, and four
graphical user interface toolkits.
 Code Snippets and Screenshots

 Library Import

 Dataset Setup
 Drawing Box

 Defining Model and other Parameters


 Defining Learner and Plotting learning rate

 Finding adequate Learning rate


 Complete model details of Box Building using RESNET34
 Bibliography

 Fastai Deep learning Course


 Fastai Library Documentation
 JSON Library Documentation
 Pillow Library Documentation
 Matplotlib Library Documentation
 Wikipedia
 Pascal VOC Dataset website

You might also like