Bai09 Descriptors
Bai09 Descriptors
Bai09 Descriptors
ECE/CSE 576
Linda Shapiro
1
Review
• Harris Operator: at each pixel
– Compute smoothed Ix2, Iy2, IxIy
– Construct the Harris matrix
– Use it to compute the response function R
– Threshold and find peaks.
• SIFT Key point finder:
– Perform difference of Gaussians for multiple
parameter ’s in multiple epochs at different
resolutions.
– Find points that are max’s and min’s and call them the
key points; record location and scale.
2
How can we find corresponding points?
How can we find correspondences?
How do we describe an image patch?
How do we describe an image patch?
7
Invariance vs. discriminability
• Invariance:
– Descriptor shouldn’t change even if image is
transformed
• Discriminability:
– Descriptor should be highly unique for each point
8
Invariance
• Most feature descriptors are designed to be
invariant to
– Translation, 2D rotation, scale
• They can usually also handle
– Limited 3D rotations (SIFT works up to about 60
degrees)
– Limited affine transformations (some are fully
affine invariant)
– Limited illumination/contrast changes
9
Scale Invariant Feature Transform
Basic idea:
• Take 16x16 square window around detected feature
• Compute edge orientation (angle of the gradient - 900) for each pixel
• Throw out weak edges (threshold gradient magnitude)
• Create histogram of surviving edge orientations
10
Adapted from slide by David Lowe
SIFT descriptor
Full version
• Divide the 16x16 window into a 4x4 grid of cells (2x2 case shown below)
• Compute an orientation histogram for each cell
• 16 cells * 8 orientations = 128 dimensional descriptor
11
Adapted from slide by David Lowe
SIFT descriptor
Full version
• Divide the 16x16 window into a 4x4 grid of cells
• Compute an orientation histogram for each cell
• 16 cells * 8 orientations = 128 dimensional descriptor
8 8 ... 8
12
Numeric Example
by Yao Lu13
L(x-1,y-1) L(x,y-1) L(x+1,y-1) 0.98
2 2
magnitude(x,y)= 𝐿 𝑥 + 1, 𝑦 − 𝐿 𝑥 − 1, 𝑦 + 𝐿 𝑥, 𝑦 + 1 − 𝐿 𝑥, 𝑦 − 1
L x,y+1 −L x,y−1
𝜃(x,y)=a𝑡𝑎𝑛( L(x+1,y)−L(x−1,y)
by Yao Lu 14
The orientations all
ended up in two bins:
Orientations in each of 11 in one bin, 5 in the
the 16 pixels of the cell other. (rough count)
5 11 0 0 0 0 0 0
15
SIFT descriptor
Full version
• Start with a 16x16 window (256 pixels)
• Divide the 16x16 window into a 4x4 grid of cells (16 cells)
• Compute an orientation histogram for each cell
• 16 cells * 8 orientations = 128 dimensional descriptor
• Threshold normalize the descriptor:
such that:
0.2
16
Adapted from slide by David Lowe
Properties of SIFT
Extraordinarily robust matching technique
• Can handle changes in viewpoint
– Up to about 30 degree out of plane rotation
• Can handle significant changes in illumination
– Sometimes even day vs. night (below)
• Fast and efficient—can run in real time
• Various code available
– http://www.cs.ubc.ca/~lowe/keypoints/
17
Example
Lowe,19IJCV04
Example: Google Goggle
20
panorama?
• We need to match (align) images
21
Matching with Features
•Detect feature points in both images
22
Matching with Features
•Detect feature points in both images
•Find corresponding pairs
23
Matching with Features
•Detect feature points in both images
•Find corresponding pairs
•Use these matching pairs to align images - the
required mapping is called a homography.
24
Automatic mosaicing
25
Recognition of specific objects, scenes
27
When does the SIFT descriptor fail?
Patches SIFT thought were the same but aren’t:
28
Other methods: Daisy
Circular gradient binning
SIFT
Daisy
29 09
Picking the best DAISY, S. Winder, G. Hua, M. Brown, CVPR
Other methods: SURF
For computational efficiency only compute
gradient histogram with 4 bins:
Daisy
32
Other methods: ORB
import numpy as np
import cv2
from matplotlib import pyplot as plt
img = cv2.imread('simple.jpg',0)
?
34
Feature distance
How to define the difference between two features f1, f2?
• Simple approach is SSD(f1, f2)
– sum of square differences between entries of the two descriptors
Σ (f1i – f2i)2
i
– But it can give good scores to very ambiguous (bad) matches
f1 f2
35
I1 I2
Feature distance in practice
How to define the difference between two features f1, f2?
• Better approach: ratio distance = SSD(f1, f2) / SSD(f1, f2’)
– f2 is best SSD match to f1 in I2
– f2’ is 2nd best SSD match to f1 in I2
– gives large values (~1) for ambiguous matches WHY?
f1 f2' f2
I1 I2 36
Eliminating more bad matches
50
true match
75
200
false match
feature distance
37
True/false positives
50
true match
75
200
false match
feature distance
39
Local Descriptors: Shape Context
Count = ?
...
Count = ?
40
Belongie & Malik, ICCV 2001
K. Grauman, B. Leibe
Texture
• The texture features of a patch can be considered a
descriptor.
• E.g. the LBP histogram is a texture descriptor for a
patch.
42
Bag-of-words models
• Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)
47
Bags of features for image classification
1. Extract features
2. Learn “visual vocabulary”
48
Bags of features for image classification
1. Extract features
2. Learn “visual vocabulary”
3. Quantize features using visual vocabulary
49
Bags of features for image classification
1. Extract features
2. Learn “visual vocabulary”
3. Quantize features using visual vocabulary
4. Represent images by frequencies of
“visual words”
50
A possible texture representation
histogram
Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik,512001;
Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003
1. Feature extraction
52
1. Feature extraction
3 2
Compute SIFT
descriptor Normalize patch
[Lowe’99]
1 Detect patches
[Mikojaczyk and Schmid ’02]
[Mata, Chum, Urban & Pajdla, ’02]
[Sivic & Zisserman, ’03]
53
Slide credit: Josef Sivic
1. Feature extraction
54
2. Discovering the visual vocabulary
55
2. Discovering the visual vocabulary
Clustering
56
Slide credit: Josef Sivic
2. Discovering the visual vocabulary
Visual vocabulary
Clustering
57
Slide credit: Josef Sivic
Viewpoint invariant description (Sivic)
• Two types of viewpoint covariant regions computed
for each frame
• Shape Adapted (SA) Mikolajczyk & Schmid
• Maximally Stable (MSER) Matas et al.
• Detect different kinds of image areas
• Provide complimentary representations of frame
• Computed at twice originally detected region size to
be more discriminating
58
Examples of Harris-Affine Operator
59
Examples of Maximally Stable Regions
60
Feature Descriptor
61
Noise Removal
•Tracking region over 70 frames (must track
over at least 3)
62
Visual Vocabulary for Sivic’s Work
63
Visual Vocabulary
Shape-Adapted
Maximally Stable
64
Sivic’s Experiments on Video Shot Retrieval
• Goal: match scene
locations within closed
world of shots
• Data:164 frames from
48 shots taken at 19
different 3D locations;
4-9 frames from each
location
65
Experiments - Results
67
Clustering and vector quantization
• Clustering is a common method for learning a visual
vocabulary or codebook
• Each cluster center produced by k-means becomes a
codevector
• Codebook can be learned on separate training set
• The codebook is used for quantizing features
• A vector quantizer takes a feature vector and maps it to the
index of the nearest code vector in a codebook
• Codebook = visual vocabulary
• Code vector = visual word
1 code vector 1
2 code vector 2
feature 3 code vector 3
vector
68
N code vector N
Another example visual vocabulary
69
Fei-Fei et al. 2005
Example codebook
…
Appearance codebook
70
Source: B. Leibe
Another codebook
…
…
…
…
…
Appearance codebook
71
Source: B. Leibe
Visual vocabularies: Issues
72
3. Image representation: histogram of codewords
frequency
…..
codewords
73
Image classification
• Given the bag-of-features representations of images from
different classes, learn a classifier using machine learning
74
But what about layout?
level 0
76
Lazebnik, Schmid & Ponce (CVPR 2006)
Spatial pyramid representation
• Extension of a bag of features
• Locally orderless representation at several levels of resolution
level 0 level 1
77
Lazebnik, Schmid & Ponce (CVPR 2006)
Spatial pyramid representation
• Extension of a bag of features
• Locally orderless representation at several levels of resolution
80
Thực hành
2. Feature Matching
• Brute-Force Matching with BRIEF/ORB Descriptors
• FLANN with BRIEF/ORB Descriptors
• Ví dụ:
81