Object Detection Using Deep Learning
Object Detection Using Deep Learning
Object Detection Using Deep Learning
Submitted by:
Hemant Dadhich (1604352, ETC-6)
Parbonee Sen (1604364, ETC-6)
Rounak Mittal (1604373, ETC-6)
Saumyajit Roy (1604384, ETC-6)
Abhigyan Nath (1604402, ETC-6)
Mentored by:
Madam Debolina Dey,
(Credentials)
This is to certify that the project report titled “Object Detection using Deep
Learning”,
submitted by:
Hemant Dadhich 1604352
Parbonee Sen 1604364
Rounak Mittal 1604373
Saumyajit Roy 1604384
Abhigyan Nath 1604402
in partial fulfillment of the requirements for the award of the Degree of Bachelor of
Technology in Electronics and Telecommunications Engineering is a bonafide
record of the work carried out under the supervision and guidance at School of
Electronics Engineering, Kalinga Institute of Industrial Technology.
Signature of Supervisor
Examiner 1 Examiner 2
Examiner 3 Examiner 4
ACKNOWLEDGEMENT
A Introduction 7-12
A.1 What is Object Detection?
A.2 Background
A.3 Application
A.4 Limitations
B.1 Introduction
H 41
Conclusion
I 42
References
List Of Figures
9 Image Retrieval
23
Cascade Classifier
A.2 Background:
The goal of object detection is to detect all instances of objects from a known
class, such as people, cars or faces in an image. Typically only a small number of
instances of the object are present in the image, but there is a very large number of
possible locations and scales at which they can occur and that need to somehow be
explored. Each detection is reported with some form of pose information. This could
be as simple as the location of the object, a location and scale, or the extent of the
object dined in terms of a bounding box. In other situations the pose information is
more detailed and contains the parameters of a linear or non-linear transformation.
For example a face detector may compute the locations of the eyes, nose and mouth,
in addition to the bounding box of the face. Object detection systems construct a
model for an object class from a set of training examples. In the case of axed rigid
object only one example may be needed, but more generally multiple training
examples are necessary to capture certain aspects of class variability.
A.3 Applications:
A. Facial Recognition- A deep
learning facial recognition system called
the “DeepFace” has been developed by a
group of researchers in
the Facebook, which identifies human
faces in a digital image very
effectively. Google uses its own facial
recognition system in Google Photos,
which automatically segregates all the photos based on the person in the image. There
are various components involved in Facial Recognition like the eyes, nose, mouth and
the eyebrows.
B.1 Abstract
TensorFlow is a machine learning system that operates at large scale and in
heterogeneous environments. TensorFlow uses dataflow graphs to represent
computation, shared state, and the operations that mutate that state. It maps the
nodes of a dataflow graph across many machines in a cluster, and within a
machine across multiple computational devices, including multicore CPUs,
generalpurpose GPUs, and custom-designed ASICs known as Tensor Processing
Units (TPUs). This architecture gives flexibility to the application developer:
whereas in previous “parameter server” designs the management of shared state is
built into the system, TensorFlow enables developers to experiment with novel
optimizations and training algorithms. TensorFlow supports a variety of
applications, with a focus on training and inference on deep neural networks.
Several Google services use TensorFlow in production.
2. Deferred execution
A typical TensorFlow application has two distinct phases: the first phase
defines the program (e.g., a neural network to be trained and the update rules)
as a symbolic dataflow graph with placeholders for USENIX Association
12th USENIX Symposium on Operating Systems Design and
Implementation.The input data and variables that represent the state; and the
second phase executes an optimized version of the program on the set of
available devices. By deferring the execution until the entire program is
available, TensorFlow can optimize the execution phase by using global
information about the computation. For example, TensorFlow achieves high
GPU utilization by using the graph’s dependency structure to issue a
sequence of kernels to the GPU without waiting for intermediate results.
While this design choice makes execution more efficient, we have had to
push more complex features—such as dynamic control flow into the dataflow
graph, so that models using these features enjoy the same optimizations.
Operations: An operation takes m>=0 tensors as input and produces n>=0 tensors
as output. An operation has a named “type” (such as Const, MatMul, or Assign)
and may have zero or more compile-time attributes that determine its behavior. An
operation can be polymorphic and variadic at compile-time: its attributes determine
both the expected types and arity of its inputs and outputs. For example, the simplest
operation Const has no inputs and a single output; its value is a compile-time attribute.
For example, AddN sums multiple tensors of the same element type, and it has a type
attribute T and an integer attribute N that define its type signature.
Stateful operations: Variables: An operation can contain mutable state that is read
and/or written each time it executes. A Variable operation owns a mutable
buffer that may be used to store the shared parameters of a model as it is trained. A
variable has no inputs, and produces a reference handle, which acts as a typed
capability for reading and writing the buffer. A Read operation takes a reference
handle r as input, and outputs the value of the variable (State[r]) as a dense tensor.
Other operations modify the underlying buffer: for example, AssignAdd takes a
reference handle r and a tensor value x, and when executed performs the update
State'[r] State[r] + x. Subsequent Read(r) operations
produce the value State'[r].
C.1 Abstract
Object detection is an important feature of computer science. The benefits of
object detection is however not limited to someone with a doctorate of informatics.
Instead, object detection is growing deeper and deeper into the common parts of the
information society, lending a helping hand wherever needed. This paper will address
one such possibility, namely the help of a Haar-cascade classifier. The main focus
will be on the case study of a vehicle detection and counting system and the
possibilities it will provide in a semi-enclosed area - both the statistical kind and also
for the common man. The goal of the system to be developed is to further ease and
augment the everyday part of our lives.
C.2 Introduction
1.1 Computer vision: Computer vision is a field of informatics, which teaches
computers to see. It is a way computers gather and interpret visual information from
the surrounding environment. Usually the image is first processed on a lower level to
enhance picture quality, for example remove noise. Then the picture is processed on a
higher level, for example detecting patterns and shapes, and thereby trying to
determine, what is in the picture.
1.3 Simple detection: By colour one way to do so, it to simply classify objects
in images according to colour. This is the main variant used in, for example, robotic
soccer, where different teams have assembled their robots and go head to head with
other 2 teams. However, this color-coded approach has its downsides. Experiments in
the International RoboCup competition have shown that the lighting conditions are
extremely detrimental to the outcome of the game and even the slightest ambient light
change can prove fatal to the success of one or the other team. Participants need to
recalibrate their systems multiple times even on the same field, because of the minor
ambient light change that occurs with the time of day. Of course, this type of
detection is not suitable for most real world applications, just because of the constant
need for recalibration and maintenance.
1.5 Cascade classifier: The cascade classifier consists of a list of stages, where
each stage consists of a list of weak learners. The system detects objects in question
by moving a window over the image. Each stage of the classifier labels the specific
region defined by the current location of the window as either positive or negative –
positive meaning that an object was found or negative means that the specified object
was not found in the image. If the labelling yields a negative result, then the
classification of this specific region is hereby complete and the location of the
window is moved to the next location. If the labelling gives a positive result, then the
region moves of to the next stage of classification. The classifier yields a final verdict
of positive, when all the stages, including the last one, yield a result, saying that the
object is found in the image. A true positive means that the object in question is
indeed in the image and the classifier labels it as such – a positive result. A false
positive means that the labelling process falsely determines, that the object is located
in the image, although it is not. A false negative occurs when the classifier is unable
to detect the actual object from the image and a true negative means that a nonobject
was correctly classifier as not being the object in question. In order to work well, each
stage of the cascade must have a low false negative rate, because if the actual object is
classified as a non-object, then the classification of that branch stops, with no way to
correct the mistake made. However, each stage can have a relatively high false
positive rate, because even if the n-th stage classifies the non-object as actually being
the object, then this mistake can be fixed in n+1-th and subsequent stages of the
classifier.
3.2.4 Training cascade: The training of the cascade proved to be no easy task. The
first necessary bit was to gather the images, then create samples based on them and
finally starting the training process. The OpenCV traincascade utility is an
improvement over its predecessor in several aspects, one of them being that
traincascade allows the training process to be multithreaded, which reduces the time it
takes to finish the training of the classifier. This multithreaded approach is only
applied during the precalculation step however, so the overall time to train is still
quite significant, resulting in hours,days and weeks of training time. Since the training
process needs a lot of positive and negative input images, which may not always be
present, then a way to circumvent this is to use a tool for the creation of such positive
images. OpenCV built in mode allows to create more positive images with distorting
the original positive image and applying a background image. However, it does not
allow to do this for multiple images. By using the Perl script createsamples to apply
distortions in batch and the mergevec tool, it is possible to create such necessary files
for each positive input file and then merging the outputted files together into one input
file that OpenCV can understand. Another important aspect to consider is the number
of positives and negatives. When executing the command to start training, it is
required to enter the number of positive and negative images 9 that will be used.
Special care should be taken with these variables, since the number of positive images
here denotes the number of positive images to be used on each step of the classifier
training, which means that if one were to specify to use all images on every step, then
at one point the training process would end in an error. This is due to the way the
training process is set up. The process needs to use many different images on every
stage of the classification and if one were to give all to the first stage, then there
would be no images left over for the second stage, thus resulting in an error message.
The training can result in many types of unwanted behaviour. Most common of these
is either overtraining or undertraining of the classifier. An undertrained classifier will
most likely output too many false positives, since the training process has not had
time to properly determine which actually is positive and which is not. An output may
look similar to image XYZ.
The opposite effect may be observed if too many stages are trained, which could
mean that the classification process may determine that even the positive objects in
the picture are actually negative ones, resulting in an empty result set. Fairly
undefined behaviour can occur if the number of input images are too low, since the
training program cannot get enough information on the actual object to be able to
classify it correctly. One of the best results obtained in the course of this work is
depicted on image XYZ. As one can observe, the classifier does detect some vehicles
without any problems, but unfortunately also some areas of the pavement and some
parts of grass are also classified as a car. Also some cars are not detected as
standalone cars.
The time taken to train the classifier to detect at this level can be measured in days
and weeks, rather than hours. Since the training process is fairly probabilistic, then a
lot of work did also go into testing the various parameters used in this work, from the
number of input images to the subtle changes in the structuring element on the 10
background removal, and verifying whether the output improved, decreased or
remained unchanged. For the same reason, unfortunately the author of this work was
unable to produce a proper classifier, which would give minimal false positives and
maximal true positives.
C. WHAT IS IMAGE CLASSIFICATION?
Image classification takes an image and predicts the object in an image. For example,
when we built a cat-dog classifier, we took images of cat or dog and predicted their
class:
What do you do if both cat and dog are present in the image:
What would our model predict? To solve this problem we can train a multi-label
classifier which will predict both the classes(dog as well as cat). However, we still
won’t know the location of cat or dog. The problem of identifying the location of an
object(given the class) in an image is called localization. However, if the object class
is not known, we have to not only determine the location but also predict the class of
each object.
Predicting the location of the object along with the class is called object
Detection.
Figure : The difference between classification (left) and object detection (right) is intuitive and straightforward.
For image classification, the entire image is classified with a single label. In the case of object detection, our
neural network localizes (potentially multiple) objects within the image.
This class label is meant to characterize the contents of the entire image, or at least
the most dominant, visible contents of the image.
For example, given the input image in Figure 1 above (left) our CNN has labeled the
image as “beagle”.
One image in
And one class label out
Object detection, regardless of whether performed via deep learning or other
computer vision techniques, builds on image classification and seeks
to localize exactly where in the image each object appears.
A list of bounding boxes, or the (x, y)-coordinates for each object in an image
The class label associated with each bounding box
The probability/confidence score associated with each bounding box and
class label
Figure 2: A non-end-to-end deep learning object detector uses a sliding window (left) + image pyramid (right)
approach combined with classification.
Okay, so at this point you understand the fundamental difference between image
classification and object detection:
Step 1 : Preprocessing
Often an input image is pre-processed to normalize contrast and brightness effects. A
very common preprocessing step is to subtract the mean of image intensities and
divide by the standard deviation. Sometimes, gamma correction produces slightly
better results. While dealing with color images, a color space transformation ( e.g.
RGB to LAB color space ) may help get better results.
Notice that I am not prescribing what pre-processing steps are good. The reason is
that nobody knows in advance which of these preprocessing steps will produce good
results. You try a few different ones and some might give slightly better results. Here
is a paragraph from Dalal and Triggs
“We evaluated several input pixel representations including grayscale, RGB and LAB
colour spaces optionally with power law (gamma) equalization. These normalizations
have only a modest effect on performance, perhaps because the subsequent descriptor
normalization achieves similar results. We do use colour information when available.
RGB and LAB colour spaces give comparable results, but restricting to grayscale
reduces performance by 1.5% at 10−4 FPPW. Square root gamma compression of
each colour channel improves performance at low FPPW (by 1% at 10−4 FPPW) but
log compression is too strong and worsens it by 2% at 10−4 FPPW.”
As you can see, they did not know in advance what pre-processing to use. They made
reasonable guesses and used trial and error.
HOG is based on the idea that local object appearance can be effectively described by
the distribution ( histogram ) of edge directions ( oriented gradients ). The steps for
calculating the HOG descriptor for a 64×128 image are listed below.
kernels.
Using the gradient images and , we can calculate the magnitude and
orientation of the gradient using the following equations.
2.
3.
The calcuated gradients are “unsigned” and therefore is in the range 0 to 180
degrees.
4.
5. Cells : Divide the image into 8×8 cells.
6. Calculate histogram of gradients in these 8×8 cells : At each pixel in an 8×8
cell we know the gradient ( magnitude and direction ), and therefore we have 64
magnitudes and 64 directions — i.e. 128 numbers. Histogram of these gradients
will provide a more useful and compact representation. We will next convert
these 128 numbers into a 9-bin histogram ( i.e. 9 numbers ). The bins of the
histogram correspond to gradients directions 0, 20, 40 … 160 degrees. Every
pixel votes for either one or two bins in the histogram. If the direction of the
gradient at a pixel is exactly 0, 20, 40 … or 160 degrees, a vote equal to the
magnitude of the gradient is cast by the pixel into the bin. A pixel where the
direction of the gradient is not exactly 0, 20, 40 … 160 degrees splits its vote
among the two nearest bins based on the distance from the bin. E.g. A pixel where
the magnitude of the gradient is 2 and the angle is 20 degrees will vote for the
second bin with value 2. On the other hand, a pixel with gradient 2 and angle 30
will vote 1 for both the second bin ( corresponding to angle 20 ) and the third bin
( corresponding to angle 40 ).
7. Block normalization : The histogram calculated in the previous step is not
very robust to lighting changes. Multiplying image intensities by a constant factor
scales the histogram bin values as well. To counter these effects we can normalize
the histogram — i.e. think of the histogram as a vector of 9 elements and divide
each element by the magnitude of this vector. In the original HOG paper, this
normalization is not done over the 8×8 cell that produced the histogram, but over
16×16 blocks. The idea is the same, but now instead of a 9 element vector you
have a 36 element vector.
8. Feature Vector : In the previous steps we figured out how to calculate
histogram over an 8×8 cell and then normalize it over a 16×16 block. To
calcualte the final feature vector for the entire image, the 16×16 block is
moved in steps of 8 ( i.e. 50% overlap with the previous block ) and the 36
numbers ( corresponding to 4 histograms in a 16×16 block ) calculated at
each step are concatenated to produce the final feature vector.What is the
length of the final vector ?
The input image is 64×128 pixels in size, and we are moving 8 pixels at a time.
Therefore, we can make 7 steps in the horizontal direction and 15 steps in the
vertical direction which adds up to 7 x 15 = 105 steps. At each step we calculated
36 numbers, which makes the length of the final vector 105 x 36 = 3780.
# coding: utf-8
# # Imports
# In[ ]:
import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile
# ## Env setup
# In[ ]:
# In[ ]:
# # Model preparation
# ## Variables
#
# Any model exported using the `export_inference_graph.py` tool can be loaded here
simply by changing `PATH_TO_FROZEN_GRAPH` to point to a new .pb file.
#
# By default we use an "SSD with Mobilenet" model here. See the [detection model
zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3d
oc/detection_model_zoo.md) for a list of other models that can be run out-of-the-box
with varying speeds and accuracies.
# In[ ]:
# Path to frozen detection graph. This is the actual model that is used for the object
detection.
PATH_TO_FROZEN_GRAPH = MODEL_NAME + '/frozen_inference_graph.pb'
# List of the strings that is used to add correct label for each box.
PATH_TO_LABELS = os.path.join('data', 'mscoco_label_map.pbtxt')
NUM_CLASSES = 90
# ## Download Model
# In[ ]:
opener = urllib.request.URLopener()
opener.retrieve(DOWNLOAD_BASE + MODEL_FILE, MODEL_FILE)
tar_file = tarfile.open(MODEL_FILE)
for file in tar_file.getmembers():
file_name = os.path.basename(file.name)
if 'frozen_inference_graph.pb' in file_name:
tar_file.extract(file, os.getcwd())
# In[ ]:
detection_graph = tf.Graph()
with detection_graph.as_default():
od_graph_def = tf.GraphDef()
with tf.gfile.GFile(PATH_TO_FROZEN_GRAPH, 'rb') as fid:
serialized_graph = fid.read()
od_graph_def.ParseFromString(serialized_graph)
tf.import_graph_def(od_graph_def, name='')
# In[ ]:
label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
categories = label_map_util.convert_label_map_to_categories(label_map,
max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)
# ## Helper code
# In[ ]:
def load_image_into_numpy_array(image):
(im_width, im_height) = image.size
return np.array(image.getdata()).reshape(
(im_height, im_width, 3)).astype(np.uint8)
# # Detection
# In[ ]:
# In[ ]:
import cv2
cap=cv2.VideoCapture(0)
# Run inference
output_dict = sess.run(tensor_dict,
feed_dict={image_tensor: np.expand_dims(image,
0)})
# all outputs are float32 numpy arrays, so convert types as appropriate
output_dict['num_detections'] = int(output_dict['num_detections'][0])
output_dict['detection_classes'] = output_dict[
'detection_classes'][0].astype(np.uint8)
output_dict['detection_boxes'] = output_dict['detection_boxes'][0]
output_dict['detection_scores'] = output_dict['detection_scores'][0]
if 'detection_masks' in output_dict:
output_dict['detection_masks'] = output_dict['detection_masks'][0]
return output_dict
# In[ ]:
ret=True
while(ret):
ret,image_np=cap.read()
We have also tried to describe to describe the TensorFlow model used in our project.
The code has successfully been debugged, and has accurately detected certain objects
as shown in the examples. This involves a very high level form of image classification
as well as detection. Image detection also helps in crowd management and CCTV
applications.
https://www.cse.iitb.ac.in/~pratikm/projectPages/objectDetection/
https://www.edureka.co/blog/tensorflow-object-detection-tutorial/
https://medium.com/@WuStangDan/step-by-step-tensorflow-object-det
ection-api-tutorial-part-1-selecting-a-model-a02b6aabe39e
https://www.oreilly.com/ideas/object-detection-with-tensorflow
https://www.slideshare.net/Brodmann17/introduction-to-object-dete
ction
https://pdfs.semanticscholar.org/0f1e/866c3acb8a10f96b432e86f8a61
be5eb6799.pdf
https://cv-tricks.com/object-detection/faster-r-cnn-yolo-ssd/
https://www.learnopencv.com/tag/object-detection/
https://pythonprogramming.net/introduction-use-tensorflow-object-
detection-api-tutorial/
https://github.com/tensorflow/models/tree/master/research/object_
detection
https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutori
als/py_feature2d/py_features_meaning/py_features_meaning.html