CS312 Module 4
CS312 Module 4
CS312 Module 4
• Any other application that involves understanding pixels through software can safely
be labeled as computer vision.
Image classification
Given a group of images, the task is to classify them into a set of
predefined classes using solely a set of sample images that have
already been classified.
As opposed to complex topics like object detection and image
segmentation, which have to localize (or give positions for) the features
they detect, image classification deals with processing the entire image
as a whole and assigning a specific label to it.
Object detection
Object detection, as the name suggests, refers to detection and
localization of objects using bounding boxes.
Object detection looks for class-specific details in an image or a video and
identifies them whenever they appear. These classes can be cars, animals,
humans, or anything on which the detection model has been trained.
This process, other than being time-consuming and largely inaccurate,
has severe limitations on the number of objects that can be detected.
As such, Deep Learning models like YOLO, RCNN, SSD that use millions of
parameters to break through these limitations are popularly employed for
this task.
Object detection is often accompanied by Object Recognition, also known
as Object Classification.
Image segmentation
Image segmentation is the division of an image into subparts or sub-
objects to demonstrate that the machine can discern an object from
the background and/or another object in the same image.
A “segment” of an image represents a particular class of object that the
neural network has identified in an image, represented by a pixel mask
that can be used to extract it.
Face and person recognition
• Facial Recognition is a subpart of object detection where the primary object
being detected is the human face.
• While similar to object detection as a task, where features are detected and
localized, facial recognition performs not only detection, but also recognition
of the detected face.
ReLU
This futuristic sounding acronym stands for Rectified Linear Unit, which is an easy function to introduce
non-linearity into the feature map. All negative values are simply changed to zero, removing all black from
the image. The formal function is y = max(0, x).
Pooling
In pooling, the image is scanned over by a set width of pixels, and either the max, sum, or average of
those pixels is taken as a representation of that portion of the image. This process further reduces the size
of the feature map(s) by a factor of whatever size is pooled.
All of these operations – Convolution, ReLu, and Pooling – are often applied twice
in a row before concluding the process of feature extraction. The outputs of this
whole process are then passed into a neural net for classification. The final
architecture looks as follows:
Overall, these challenges highlight the fact that computer vision is a difficult and complex field, and that there is still much work to be done in
order to build machines that can see and understand the world in the same way humans do.