Recognition Local Features

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

COMPUTER VISION

RECOGNITION WITH LOCAL FEATURES

Ngo Quoc Viet-2024


CONTENTS
1. Detector and Descriptor for local features
• Invariant Local Features
• Keypoint Detector
• Local feafure Detector and Descriptor
2. Recognition using local features
• Local Feature Matching
• Affine and Homography Estimation
3. Panorama

2
LECTURE OUTCOMES
• Understanding Detector and Descriptor for
local features.
• SIFT
• Affine and Homography Estimation
• Local Feature Matching
• Implementing panorama project using SIFT
and homography estimation

3
LOCAL FEATURES
• Obtain a representation that allows us to find a particular object we've
encountered before
• Local features based on the appearance of the object at particular
interest points.
• Features should be reasonably invariant to illumination changes, and
ideally, also to scaling, rotation, and minor changes in viewing direction
• In addition, we can use local features for matching, this is useful for
tracking and 3D scene reconstruction.

4
LOCAL FEATURES
Key properties of a good local feature:
• Must be highly distinctive, a good feature should allow for correct object
identification with low probability of mismatch.
• Should be easy to extract.
• Invariance, a good local feature should be tolerant to
• Image noise
• Changes in illumination
• Uniform scaling
• Rotation
• Should be easy to match against a (large) database of local features.

5
SCALE INVARIANT FEATURE TRANSFORM
• Scale Invariant Feature Transform (SIFT) is an approach for detecting and extracting
local feature descriptors that are reasonably invariant to changes in illumination,
image noise, rotation, scaling, and small changes in viewpoint.
• It detects distinctive key points or features in an image that are robust to changes in
scale, rotation, and affine transformations.
• SIFT works by identifying keypoints based on their local intensity extrema and
computing descriptors that capture the local image information around those
keypoints.
• These descriptors can then be used for tasks like image matching, object
recognition, and image retrieval.
• The major advantage of SIFT features, over-edge features, or hog features is that
they are not affected by the size or orientation of the image.
David G.Lowe (2004). Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision.
6
SIFT STAGES
• Detection stages for SIFT
features:
• Scale-space extrema
detection.
• Keypoint localization.
• Orientation assignment.
• Generation of keypoint
descriptors.

7
SCALE-SPACE EXTREMA DETECTION
• A corner may not be a corner if the image is scaled. A
corner in a small image within a small window is flat
when it is zoomed in the same window. We can't use the
same window to detect keypoints with different scale.
That’s why we need to build scale-space. Pic from:
https://docs.opencv.org/4.x/da/df5/tutorial_p
y_sift_intro.html

• We need to identify the most distinct features in a given


input image while ignoring any noise as well asensure
that the features are not scale-dependent.
• To create a scale space, you take the original image and
generate progressively blurred out images. Pic from:
https://www.astesj.com/v06/i01/p143/

8
SCALE-SPACE EXTREMA DETECTION
• Gaussian kernel is used to create scale space
𝐿 𝑥, 𝑦, 𝜎 = 𝐺 𝑥, 𝑦, 𝜎 ∗ 𝐼 𝑥, 𝑦
where
1 −(𝑥 2 +𝑦 2 )/2𝜎 2
𝐺 𝑥, 𝑦, 𝜎 = 2
𝑒
2𝜋𝜎
• σ is the “scale” parameter. Consider it as amount of blurring.
• Amount of blurring
• Assume the amount of blur in a particular image is . Then, the amount of blur
in the next image will be k*.

9
SCALE-SPACE EXTREMA DETECTION
• Laplacian of Gaussian (𝐿𝑜𝐺 = 𝜎 2 ∆2 𝐺) filtering is • Gaussian filter with scale  is determined
used to build Scale-space. LoG acts as a blob by.
detector which detects blobs in various sizes 1 −
𝑥 2 +𝑦 2
2𝜎2
due to change in . 𝐺 𝜎 =
2𝜋𝜎 2
𝑒
• LoG filter with scale  is determined by.
• In short,  acts as a scaling parameter. 2 2 𝑥 2 +𝑦 2
Gaussian filter with low  gives high value for 1 𝑥 + 𝑦 −
𝐿𝑜𝐺𝜎 = − 4 1 − 𝑒 2𝜎2
small corner while gaussian kernel with high  𝜋𝜎 2𝜎 2
fits well for larger corner. • DoG filter with two different 1 and 2
(2 > 1) is determined by.
𝑥 2 +𝑦 2 𝑥 2 +𝑦 2
1 1 − 2𝜎 1 − 2𝜎
𝐷𝑜𝐺𝜎1,𝜎2 = 2 𝑒 1 − 2 𝑒 2
2𝜋 𝜎1 𝜎2
• In SIFT, 𝜎2 = 𝑘𝜎1 .
• But this LoG is a little costly, so SIFT algorithm
uses Difference of Gaussians (DoG) which is an img = cv2.GaussianBlur(img, (5, 5), sigma)
approximation of LoG. img = cv2.Laplacian(img, cv2.CV_64F)
10
SCALE-SPACE EXTREMA DETECTION
• Approximation of Laplacian of
Gaussians
2
𝜕𝐺 𝐺 𝑥, 𝑦, 𝑘𝜎 − 𝐺(𝑥, 𝑦, 𝜎)
𝜎𝛻 𝐺 = ≈
𝜕𝜎 𝑘𝜎 − 𝜎

𝐺 𝑥, 𝑦, 𝑘𝜎 − 𝐺 𝑥, 𝑦, 𝜎
≈ 𝑘 − 1 𝜎 2𝛻 2𝐺
𝐷 𝑥, 𝑦, 𝜎
= 𝐺 𝑥, 𝑦, 𝑘𝜎 − 𝐺 𝑥, 𝑦, 𝜎 ∗ 𝐼(𝑥, 𝑦)
𝐷 𝑥, 𝑦, 𝜎 = 𝐿 𝑥, 𝑦, 𝑘𝜎 − 𝐿 𝑥, 𝑦, 𝜎 For each octave of scale space, initial image is repeatedly
convolved with Gaussians to produce the set of scale
• The difference of the Gaussian- space images. There are some fixed levels in each octave.
blurred images at scales  and k. After each octave, the Gaussian image is down-sampled
by factor of 2, and the process is repeated.
11
SCALE-SPACE EXTREMA DETECTION
• Just the Laplacian of Gaussian images aren’t great. They are not scale
invariant. See the 2 at denominator of G(x, y, ).
• Then the scale invariant Laplacian of Gaussian would look like this
𝜎 2 ∆2 𝐺
• The resultant images after the DoG operation are already multiplied by
the 2 → Scale Invariant.
• However, we still have (k-1) in DoG pyramid.
• But we’ll just be looking for the location of the maximums and
minimums in the images.

12
SCALE-SPACE EXTREMA DETECTION
• The first step toward the detection of interest points is the convolution of the image
with Gaussian filters at different scales, and the generation of difference of
Gaussian from the difference of adjacent blurred images
• The convolved images are grouped by octave (an octave corresponds to doubling the
value of  )
• And the value of k is selected so that we obtain a fixed number of blurred images
per octave. If k=1, each octave has only one DoG image.

13
SCALE-SPACE EXTREMA DETECTION
• The domain of the variable  is discretized in logarithm steps arranged
in O octaves.
• Each octave is further subdivided in S sublevels.
• The octave index o and sublevel index s are mapped to corresponding
scale  by the formula
𝜎 𝑜, 𝑠 = 𝜎0 2𝑜+𝑠/𝑆 , 𝑜 ∈ 𝑜𝑚𝑖𝑛 + 0, . . , 𝑂 − 1 , 𝑠 ∈ 0, 𝑆 − 1

• Scale multiplicative step: 𝑘 = 21/𝑆


• If there is no 0 then: 𝜎0 = 1.6 ∗ 𝑘
• Scale step factor: 𝑑𝜎0 = 𝜎0 . 1 − 1/𝑘 2

14
SCALE-SPACE EXTREMA DETECTION
• Octaves and Scales
• The number of octaves and scale depends on the size of the original image. The
creator of SIFT suggests that 4 octaves and 5 blur levels are ideal for the
algorithm
• The first octave
• If the original image is doubled in size and antialiased a bit (by blurring it) then
the algorithm produces more four times more keypoints. The more the
keypoints, the better.

15
KEYPOINT LOCALIZATION
• Keypoint Localization by locating DoG
Extreme
• Keypoints are maxima (minima) of DoG that
occur at multiple scales.
• Locate detector via DoG extreme of all
scales
• Choose all extreme within 3x3x3
neighborhood. To locate the local maxima
and minima, we go through every pixel in
the image and compare it with its
neighboring pixels (DoG image and 2
adjacent images (26 comparisons) to find
Min and Max → candidate keypoints).

16
KEYPOINT LOCALIZATION
• Once potential keypoints locations are found, they have to be refined to get more
accurate results.
• Hence, we will eliminate the keypoints that have low contrast or lie very close to the
edge.
• To deal with the low contrast keypoints, a second-order Taylor expansion is
computed for each keypoint. If the resulting value is less than 0.03 (in magnitude),
this keypoint is rejected.
• These are the keypoints that are close to the edge and have a high edge response
but may not be robust to a small amount of noise (edges also need to be removed).
A second-order Hessian matrix is used to identify such keypoints.
• We will now assign an orientation value for each keypoint to make the rotation
invariant.

17
KEYPOINT LOCALIZATION
• Taylor series expansion of degree two of scale-space function, D(x, y, ), shifted so that the
origin is at the candidate keypoint. D represents the Difference of Gaussian.
𝑇 𝑇
𝜕𝐷 1 𝑇 𝜕2𝐷
𝐷 𝑥Ԧ = 𝐷 + 𝑥Ԧ + 𝑥Ԧ Ԧ 𝑥Ԧ = (𝑥, 𝑦, 𝜎)𝑇
𝑥;
𝜕 𝑥Ԧ 2 𝜕 𝑥Ԧ 2
(x, y, )T is the offset from candidate keypoint.
• The location of the extrema is determined by taking derivative of this function with respect
to 𝑥Ԧ and setting it to zero (differentiate and set to 0).
• Minimize to get true location maxima
−1
𝜕2𝐷 𝜕𝐷
𝑥ො = − ×
𝜕 𝑥Ԧ 2 𝜕 𝑥Ԧ
• To remove the unstable key points, the value of 𝑥ො is calculated and if the function value at 𝑥ො
is below a threshold value then the point is excluded

18
KEYPOINT LOCALIZATION-CONTRAST THRESHOLD
• Let 𝑎 be a candidate keypoint, 𝑥Ԧ are neighbors of 𝑎. Tailor expansive of degree two is
determine by
1
𝐷 𝑥Ԧ = 𝐷(𝑎) + 𝐷′(𝑎)(𝑥Ԧ − 𝑎) + 𝐷′′(𝑎) (𝑥Ԧ − 𝑎)2
2
• Extreme is the location which 𝐷′ 𝑥Ԧ = 0. In this case, 𝐷 ′ 𝑎 + 𝐷 ′′ 𝑎 (𝑥Ԧ − 𝑎) = 0
𝐷′ 𝑎
𝑥ො = − ′′ +𝑎
𝐷 𝑎
• Low Contrast Points Filter using Scale Space values at candidate keypoints. Substituting 𝑥ො
from above formula to 𝐷 𝑥Ԧ , we have
−1
1 𝜕𝐷
𝐷 𝑥ො = 𝐷 + 𝑥ො
2 𝜕𝑥
• Remove keypoints which have 𝐷 𝑥ො smaller than 0.03 (images values in [0, 1]).

19
KEYPOINT LOCALIZATION-EDGE THRESHOLD
• The idea is to calculate two gradients at the keypoint. Both
perpendicular to each other. Based on the image around the keypoint,
three possibilities exist. The image around the keypoint can be
• A flat region: both gradients will be small
• An edge: one gradient will be big (perpendicular to the edge) and the other will
be small (along the edge)
• A “corner”: both gradients will be big. Corners are great keypoints. If both
gradients are big enough, we let it pass as a key point. Otherwise, it is rejected.
• Mathematically, this is achieved by the Hessian Matrix. More detail:
http://www.aishack.in/2010/04/harris-corner-detector/

20
ORIENTATION ASSIGNMENT
• Use scale of point (scale at key point is determined)
to choose correct image.
𝐿 𝑥, 𝑦, 𝜎 = 𝐺 𝑥, 𝑦, 𝜎 ∗ 𝐼 𝑥, 𝑦
• Compute gradient magnitude and orientation
2 2
𝑚 𝑥, 𝑦 = 𝐿 𝑥 + 1, 𝑦 − 𝑥 − 1, 𝑦 + 𝐿 𝑥, 𝑦 + 1 − 𝑥, 𝑦 − 1

−1
𝐿 𝑥, 𝑦 + 1 − 𝐿 𝑥, 𝑦 − 1 To achieve detection which is invariant with
𝜃 𝑥, 𝑦 = 𝑡𝑎𝑛 respect to the rotation of the image,
𝐿 𝑥 + 1, 𝑦 − 𝐿 𝑥 − 1, 𝑦
orientation needs to be calculated for the
• Create gradient histogram (36 bins – 10 degrees for key-points. This is done by considering the
each) neighborhood of the keypoint and
• Weighted by magnitude and Gaussian window ( is 1.5 calculating the magnitude and direction of
times that of the scale of keypoint). gradients of the neighborhood. Based on the
values obtained, a histogram is constructed
• Calculate orientation for every point Gaussian window,
then create histogram for them with 36 bins to represent 360 degrees of
orientation(10 degrees per bin).
21
KEYPOINT DESCRIPTOR
• Each keypoint has (x, y, , m, ).
• A 16×16 neighborhood of the keypoint is used
for defining the descriptor of that key-point.
This 16×16 neighborhood is divided into sub-
block. Each such sub-block is a non-
overlapping, contiguous, 4×4 neighborhood.
• 4x4 Gradient window (4x4 pixel sub-regions)
• Histogram of 4x4 samples per window in 8
directions
• 4x4x8 = 128 dimension feature vector (4x4
histograms, 8 bins for each). We can use 2x2
histograms or else.

22
SIFT DESCRIPTOR
• In order to get orientation invariance, the coordinates of the descriptor
and gradients orientations are rotated relative to the keypoint
orientation.
• The purpose of “ equal 1.5 times of  of keypoint scale” is to avoid
sudden changes in the descriptor with small changes in the position of
the window, and to give less emphasis to gradients that are far from the
center of the descriptor.
• Finally, feature vector is normalized to unit to avoid linear changes of
illumination.

23
SIFT IN OPENCV

24
MATCHING WITH SIFT
• Brute force Matcher with L1 or L2 or kNN to match two SIFT descriptors.
• Using bf = cv2.BFMatcher → bf.match with L1 or L2 distances.
• Using bf.knnMatch with Ratio Test
• FLANN based Matcher: FLANN stands for Fast Library for Approximate
Nearest Neighbors. It contains a collection of algorithms optimized for
fast nearest neighbor search in large datasets and for high dimensional
features. It works faster than BFMatcher for large datasets.
• For FLANN based matcher, we need to pass two dictionaries which
specifies the algorithm to be used, its related parameters etc

25
MATCHING WITH SIFT

26
MATCHING WITH SIFT

27
FACE DETECTION WITH SIFT
• Extract SIFT descriptors of
source image (face being
detected) and target (video
frames) as usual.
• Matching two SIFT
descriptors with
cv2.BFMatcher or
cv2.FlannBasedMatcher.
• Check if number of
matched descriptors is
greater threshold.

28
AFFINE TRANSFORMATION ESTIMATION
• Affine estimation, similar to homography estimation, is a concept in computer
vision and image processing.
• Affine transformation estimation is the process of estimating the parameters of an
affine transformation based on point data.
• An affine transformation includes operations such as translation, rotation, scaling,
and projection.
• Affine transformation estimation is a crucial tool in computer vision. It is used in:
Image Registration (aligning images from different sources or capturing conditions),
Object Recognition (Align and normalize object instances for recognition. When
objects undergo translation, rotation, scaling, or shearing, estimating the affine
transformation allows for invariant representation); Panorama Stitching (creating
panoramic images from multiple photos, each photo needs to be aligned properly to
form a seamless panorama)

29
AFFINE TRANSFORMATION ESTIMATION
• Prepare Corresponding Points: Identify corresponding points between the source
space and the target space
• Construct Data Matrix: Create a data matrix that holds all the corresponding points
from both spaces. Each row of the matrix represents a pair of corresponding points.
• Setup the System of Equations:
𝑥′ ℎ1 ℎ 2 ℎ 3 𝑥
• Affine transformation matrix for a 2D space and 3D: 𝑦′ = ℎ4 ℎ5 ℎ6 𝑦
1 0 0 1 1
• Solve the System of Equations: Use methods like the least squares fitting or
RANSAC to solve the system of equations and estimate the parameters of the affine
transformation. The solution should minimize the difference between the actual and
transformed coordinates. These parameters include values for performing
translation, rotation, scaling, and projection

30
AFFINE TRANSFORMATION ESTIMATION
• Estimating an affine transformation typically involves finding the coefficients
of the transformation matrix that maps points from one coordinate system to
another.
• Establish point correspondences (at least 4 points in the input image and 4
corresponding points in the target image).
• Various techniques can be used to estimate affine transformations, including
direct linear transformation (DLT), least squares methods, and robust
estimation techniques like RANSAC (Random Sample Consensus).

• cv2.estimateAffine2D(srcPts, dstPts, cv2.RANSAC, ransacReprojThreshold=5)


ransacReprojThreshold: Maximum allowed reprojection error to treat a point pair as an inlier (used in the RANSAC and
RHO methods only).
31
HOMOGRAPHY ESTIMATION
• Homography estimation is a technique used in computer vision and image processing to find the
relationship between two images of the same scene, but captured from different viewpoints.
• A homography is a transformation that maps points from one plane to another. In the context of
computer vision, it is often used to represent the relationship between two images of the same
scene taken from different viewpoints or under different camera poses.
• A homography matrix (H) is a 3x3 matrix that defines a projective transformation between two
planes. Given a set of corresponding points in the two images, the goal of homography
estimation is to compute the matrix H that best describes the transformation between the
images. This is particularly useful in applications such as image stitching, object recognition,
and augmented reality.
• The homography estimation process involves solving a system of linear equations formed by the
correspondences between points in the two images. Methods like the Direct Linear Transform
(DLT) or RANSAC (Random Sample Consensus) are commonly used for this purpose.

32
HOMOGRAPHIES INDUCED BY CENTRAL PROJECTION
• Projection along rays through a common point (center
of projection) defines a mapping from one plane to
another.
• Central projection maps points on one plane to points
on another plane.
• Projection also maps lines to lines: consider a plane
through projection center that intersects the two
planes → lines mapped onto lines.
• Central projection may be expressed by
𝑥′ ℎ1 ℎ2 ℎ3 𝑥
𝑠 𝑦′ = ℎ4 ℎ5 ℎ6 𝑦
1 ℎ7 ℎ8 ℎ9 1
in which, s is scale factor

33
ESTIMATING THE HOMOGRAPHY BETWEEN OVERLAPPING IMAGES

• Establish point correspondences (at least 4 points in the input image and 4 corresponding
points in the target image). The problem is that there are a lot of corresponding points
between 2 images → how to select best-fit 4 point pairs. RANSAC (Random Sample
Consesus).
• RANSAC is a learning technique to estimate parameters of a model by random sampling of
observed data. In this case, RANSAC samples any 4 pairs of random points and calculates
matrix H, calculates the loss between target points and input points after transforming with
H. Loss is determined by

𝑙𝑜𝑠𝑠 = ෍ 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐻 ∗ 𝑥𝑖 , 𝑥𝑖′ )


𝑖
in which, 𝑥𝑖 , 𝑥𝑖′ is point pair. Repeating the sampling process – calculate this loss a large
enough number of times, then choose H with the minimum loss.
• More RANSAC at: https://www.cse.psu.edu/~rtc12/CSE486/lecture15.pdf,
https://www.baeldung.com/cs/ransac.
• cv2.findHomography(srcPts, dstPts, cv2.RANSAC, ransacReprojThreshold=5)
ransacReprojThreshold: Maximum allowed reprojection error to treat a point pair as an inlier (used in the RANSAC and
RHO methods only). 34
AFFINE ESTIMATION VS. HOMOGRAPHY ESTIMATION
Affine Homography
Parameters 6 for 2D or 12 for 3D 8 for 2D or 16 for 3D
Use cases Affine transformations are suitable for Homography is often used when dealing
applications where perspective with images taken from different viewpoints
changes are negligible, such as image or perspectives. It is suitable for image
resizing, object recognition, and stitching, panorama creation, or camera
texture mapping in computer graphics calibration
Perspective Distortions Affine transformations cannot model Homography can handle perspective
perspective distortions accurately. distortions, which occur when objects are
While they can handle scaling, viewed from different angles or distances
rotation, and translation
Matrix size 2x3 for 2D and 3x4 for 3D 3x3 for 2D and a 4x4 for 3D
Correspondence Points Affine transformation estimation Homography estimation typically requires
requires three or more corresponding four or more corresponding points between
points between two images or scenes two images or scenes

35
PANORAMA IMAGE
Based on Image Stitching
1. Detect keypoints and descriptors (SIFT,
Speeded Up Robust Features-SURF,
Oriented FAST and Rotated BRIEF-ORB)
2. Detect a set of matching points that is
present in the images
3. Apply the RANSAC method to improve
the matching process detection
4. Apply perspective transformation on
one image using the other image as a
reference frame
5. Stitch images together.

From: https://pyimagesearch.com/2018/12/17/image-stitching-with-
opencv-and-python/

36
PANORAMA WITH TWO IMAGES
The images

Feature Descriptors
SIFT or ORB

Feature Matching
FLANN or BF

Homography Estimation
with RANSAC

Alignment Blending Color Correction

Stitching

Panorama Image
37
STITCHING TWO IMAGES
• Estimation homography matrix between two
images (RANSAC for SIFT/ORB keypoints of
them).
• Applying perspective transformation (transform
an image from one projection angle to a
completely different one).

𝑥′ ℎ1 ℎ2 ℎ3 𝑥 𝑥
𝑠 𝑦′ = ℎ4 ℎ5 ℎ6 𝑦 =𝐻 𝑦
1 ℎ7 ℎ8 ℎ9 1 1
• cv2.findHomography(srcPts, dstPts,
cv2.RANSAC, ransacReprojThreshold=5)
• cv2.warpPerspective(img, H, maxWid,
maxHei, flags=cv2.INTER_BICUBIC)
38
STITCHING MULTIPLE IMAGES
Multiple images stitching available in random order Multiple images stitching available in left-right order

1. Performing some preprocessing steps, such as 1. Performing some preprocessing steps, such as
resizing and grayscale conversion, to enhance resizing and grayscale conversion, to enhance
the results. the results.

2. Compute all pairwise homographies H01 , H02, 2. Left Stichting for images in left half of the image
H03, H12, H13, …, where H01 warp I0 into I1 . list to get the leftPano.

3. Select one anchor image e.g. I1 which position 3. Right Stichting for images in right half of the
will remain fixed. image list to get the rightPano.

4. Find image that better align with I1 based on 4. Stitching leftPano to rightPano if rightPano’s
maximum number of consistent matches, e.g. width is greater than leftPano’s width and vice
I3. versa.s

5. Now we have a new stitched image, and we can


features in it and match them again in other
frames and so on.
Read more: Chapter 8. Image alignment and Stitching. Book: Richard Szeliski (2022). Computer Vision: Algorithms and
Applications, 2nd Edition. Springer. 39
TYPES OF WARPING
• Planar: wherein every image is an
element of a plane surface, subject
to translation and rotations …
(cv2.warpPerspective)
• Cylindrical: wherein every image is
represented as if the coordinate
system was cylindrical. and the
image was plotted on the curved
surface of the cylinder.
• Spherical: the above appends,
instead of a cylinder, the reference
model is a sphere.

40
PANORAMA TWO IMAGES USING OPENCV’S STITCHER

41

You might also like