Bai09 Descriptors

Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

Patch Descriptors

ECE/CSE 576
Linda Shapiro

1
Review
• Harris Operator: at each pixel
– Compute smoothed Ix2, Iy2, IxIy
– Construct the Harris matrix
– Use it to compute the response function R
– Threshold and find peaks.
• SIFT Key point finder:
– Perform difference of Gaussians for multiple
parameter ’s in multiple epochs at different
resolutions.
– Find points that are max’s and min’s and call them the
key points; record location and scale.

2
How can we find corresponding points?
How can we find correspondences?
How do we describe an image patch?
How do we describe an image patch?

Patches with similar content should have similar descriptors.


Raw patches as local descriptors

The simplest way to describe the


neighborhood around an interest
point is to write down the list of
intensities to form a feature vector.

But this is very sensitive to even


small shifts, rotations.

7
Invariance vs. discriminability
• Invariance:
– Descriptor shouldn’t change even if image is
transformed
• Discriminability:
– Descriptor should be highly unique for each point

8
Invariance
• Most feature descriptors are designed to be
invariant to
– Translation, 2D rotation, scale
• They can usually also handle
– Limited 3D rotations (SIFT works up to about 60
degrees)
– Limited affine transformations (some are fully
affine invariant)
– Limited illumination/contrast changes

9
Scale Invariant Feature Transform
Basic idea:
• Take 16x16 square window around detected feature
• Compute edge orientation (angle of the gradient - 900) for each pixel
• Throw out weak edges (threshold gradient magnitude)
• Create histogram of surviving edge orientations

10
Adapted from slide by David Lowe
SIFT descriptor
Full version
• Divide the 16x16 window into a 4x4 grid of cells (2x2 case shown below)
• Compute an orientation histogram for each cell
• 16 cells * 8 orientations = 128 dimensional descriptor

11
Adapted from slide by David Lowe
SIFT descriptor
Full version
• Divide the 16x16 window into a 4x4 grid of cells
• Compute an orientation histogram for each cell
• 16 cells * 8 orientations = 128 dimensional descriptor

8 8 ... 8

12
Numeric Example

0.37 0.79 0.97 0.98

0.08 0.45 0.79 0.97

0.04 0.31 0.73 0.91

0.45 0.75 0.90 0.98

by Yao Lu13
L(x-1,y-1) L(x,y-1) L(x+1,y-1) 0.98

L(x-1,y) L(x,y) L(x+1,y) 0.97


𝜃(x,y)

L(x-1,y+1) L(x,y+1) L(x+1,y+1) 0.91

0.45 0.75 0.90 0.98

2 2
magnitude(x,y)= 𝐿 𝑥 + 1, 𝑦 − 𝐿 𝑥 − 1, 𝑦 + 𝐿 𝑥, 𝑦 + 1 − 𝐿 𝑥, 𝑦 − 1

L x,y+1 −L x,y−1
𝜃(x,y)=a𝑡𝑎𝑛( L(x+1,y)−L(x−1,y)

by Yao Lu 14
The orientations all
ended up in two bins:
Orientations in each of 11 in one bin, 5 in the
the 16 pixels of the cell other. (rough count)
5 11 0 0 0 0 0 0
15
SIFT descriptor
Full version
• Start with a 16x16 window (256 pixels)
• Divide the 16x16 window into a 4x4 grid of cells (16 cells)
• Compute an orientation histogram for each cell
• 16 cells * 8 orientations = 128 dimensional descriptor
• Threshold normalize the descriptor:

such that:

0.2

16
Adapted from slide by David Lowe
Properties of SIFT
Extraordinarily robust matching technique
• Can handle changes in viewpoint
– Up to about 30 degree out of plane rotation
• Can handle significant changes in illumination
– Sometimes even day vs. night (below)
• Fast and efficient—can run in real time
• Various code available
– http://www.cs.ubc.ca/~lowe/keypoints/

17
Example

NASA Mars Rover images


with SIFT feature matches 18
Figure by Noah Snavely
Example: Object Recognition

SIFT is extremely powerful for object instance


recognition, especially for well-textured objects

Lowe,19IJCV04
Example: Google Goggle

20
panorama?
• We need to match (align) images

21
Matching with Features
•Detect feature points in both images

22
Matching with Features
•Detect feature points in both images
•Find corresponding pairs

23
Matching with Features
•Detect feature points in both images
•Find corresponding pairs
•Use these matching pairs to align images - the
required mapping is called a homography.

24
Automatic mosaicing

25
Recognition of specific objects, scenes

Schmid and Mohr 1997 Sivic and Zisserman, 2003

Rothganger et al. 2003 Lowe 2002 26


Kristen Grauman
Example: 3D Reconstructions
• Photosynth (also called Photo Tourism) developed at
UW by Noah Snavely, Steve Seitz, Rick Szeliski and
others
http://www.youtube.com/watch?v=p16frKJLVi0

• Building Rome in a day, developed at UW by Sameer


Agarwal, Noah Snavely, Steve Seitz and others
http://www.youtube.com/watch?v=kxtQqYLRaSQ&featu
re=player_embedded

27
When does the SIFT descriptor fail?
Patches SIFT thought were the same but aren’t:

28
Other methods: Daisy
Circular gradient binning

SIFT

Daisy

29 09
Picking the best DAISY, S. Winder, G. Hua, M. Brown, CVPR
Other methods: SURF
For computational efficiency only compute
gradient histogram with 4 bins:

SURF: Speeded Up Robust Features


Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, ECCV 2006
30
Other methods: BRIEF
Randomly sample pair of pixels a and b.
011000111000...
1 if a > b, else 0. Store binary vector.

Daisy

BRIEF: binary robust independent elementary features,


31
Calonder, V Lepetit, C Strecha, ECCV 2010
Other methods: ORB
ORB (Oriented FAST and Rotated BRIEF)
• It is a good alternative to SIFT and SURF in
computation cost, matching performance and mainly
the patents.
• Yes, SIFT and SURF are patented and you are
supposed to pay them for its use. But ORB is not !!!
• ORB is basically a fusion of FAST keypoint detector
and BRIEF descriptor with many modifications to
enhance the performance.

32
Other methods: ORB
import numpy as np
import cv2
from matplotlib import pyplot as plt

img = cv2.imread('simple.jpg',0)

# Initiate STAR detector


orb = cv2.ORB()

# find the keypoints with ORB


kp = orb.detect(img,None)

# compute the descriptors with ORB


kp, des = orb.compute(img, kp)

# draw only keypoints location,not size and orientation


img2 = cv2.drawKeypoints(img,kp,color=(0,255,0), flags=0)
plt.imshow(img2),plt.show()
33
Descriptors and Matching
• The SIFT descriptor and the various variants are
used to describe an image patch, so that we can
match two image patches.

• In addition to the descriptors, we need a distance


measure to calculate how different the two patches
are?

?
34
Feature distance
How to define the difference between two features f1, f2?
• Simple approach is SSD(f1, f2)
– sum of square differences between entries of the two descriptors

Σ (f1i – f2i)2
i
– But it can give good scores to very ambiguous (bad) matches

f1 f2

35
I1 I2
Feature distance in practice
How to define the difference between two features f1, f2?
• Better approach: ratio distance = SSD(f1, f2) / SSD(f1, f2’)
– f2 is best SSD match to f1 in I2
– f2’ is 2nd best SSD match to f1 in I2
– gives large values (~1) for ambiguous matches WHY?

f1 f2' f2

I1 I2 36
Eliminating more bad matches

50
true match
75
200
false match

feature distance

Throw out features with distance > threshold


• How to choose the threshold?

37
True/false positives

50
true match
75
200
false match

feature distance

The distance threshold affects performance


• True positives = # of detected matches that are correct
– Suppose we want to maximize these—how to choose threshold?
• False positives = # of detected matches that are incorrect
– Suppose we want to minimize these—how to choose threshold?
38
Other kinds of descriptors

• There are descriptors for other purposes


• Describing shapes
• Describing textures
• Describing features for image classification
• Describing features for a code book

39
Local Descriptors: Shape Context

Count the number of points


inside each bin, e.g.:

Count = ?

...
Count = ?

Log-polar binning: more


precision for nearby points,
more flexibility for farther
points.

40
Belongie & Malik, ICCV 2001
K. Grauman, B. Leibe
Texture
• The texture features of a patch can be considered a
descriptor.
• E.g. the LBP histogram is a texture descriptor for a
patch.

; Varma & Zisserman, 2002, 2003;, 2003 41


Bag-of-words models
• Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)

42
Bag-of-words models
• Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)

US Presidential Speeches Tag Cloud 43


http://chir.ag/phernalia/preztags/
Bag-of-words models
• Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)

US Presidential Speeches Tag Cloud 44


http://chir.ag/phernalia/preztags/
Bag-of-words models
• Orderless document representation: frequencies of words
from a dictionary Salton & McGill (1983)

US Presidential Speeches Tag Cloud 45


http://chir.ag/phernalia/preztags/
What is a bag-of-words representation?
• For a text document
• Have a dictionary of non-common words
• Count the occurrence of each word in that document
• Make a histogram of the counts
• Normalize the histogram by dividing each count by
the sum of all the counts
• The histogram is the representation.

apple worm tree dog joint leaf grass bush fence


46
Bags of features for image classification
1. Extract features

47
Bags of features for image classification
1. Extract features
2. Learn “visual vocabulary”

48
Bags of features for image classification
1. Extract features
2. Learn “visual vocabulary”
3. Quantize features using visual vocabulary

49
Bags of features for image classification

1. Extract features
2. Learn “visual vocabulary”
3. Quantize features using visual vocabulary
4. Represent images by frequencies of
“visual words”

50
A possible texture representation

histogram

Universal texton dictionary

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik,512001;
Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003
1. Feature extraction

• Regular grid: every grid square is a feature


• Vogel & Schiele, 2003
• Fei-Fei & Perona, 2005
• Interest point detector: the
region around each point
• Csurka et al. 2004
• Fei-Fei & Perona, 2005
• Sivic et al. 2005

52
1. Feature extraction
3 2

Compute SIFT
descriptor Normalize patch
[Lowe’99]

1 Detect patches
[Mikojaczyk and Schmid ’02]
[Mata, Chum, Urban & Pajdla, ’02]
[Sivic & Zisserman, ’03]

53
Slide credit: Josef Sivic
1. Feature extraction

Lots of feature descriptors


for the whole image or set
of images.

54
2. Discovering the visual vocabulary

feature vector space

What is the dimensionality?

128D for SIFT

55
2. Discovering the visual vocabulary

Clustering

56
Slide credit: Josef Sivic
2. Discovering the visual vocabulary
Visual vocabulary

Clustering

57
Slide credit: Josef Sivic
Viewpoint invariant description (Sivic)
• Two types of viewpoint covariant regions computed
for each frame
• Shape Adapted (SA) Mikolajczyk & Schmid
• Maximally Stable (MSER) Matas et al.
• Detect different kinds of image areas
• Provide complimentary representations of frame
• Computed at twice originally detected region size to
be more discriminating

58
Examples of Harris-Affine Operator

(Shape Adapted Regions)

59
Examples of Maximally Stable Regions

60
Feature Descriptor

• Each region represented by 128 dimensional vector


using SIFT descriptor

61
Noise Removal
•Tracking region over 70 frames (must track
over at least 3)

62
Visual Vocabulary for Sivic’s Work

• Implementation: K-Means clustering

• Regions tracked through contiguous frames and average description


computed

• 10% of tracks with highest variance eliminated, leaving about 1000


regions per frame

• Subset of 48 shots (~10%) selected for clustering

• Distance function: Mahalanobis

• 6000 SA clusters and 10000 MS clusters

63
Visual Vocabulary

Shape-Adapted

Maximally Stable

64
Sivic’s Experiments on Video Shot Retrieval
• Goal: match scene
locations within closed
world of shots
• Data:164 frames from
48 shots taken at 19
different 3D locations;
4-9 frames from each
location

65
Experiments - Results

Precision = # relevant images/total # of frames retrieved


Recall = # correctly retrieved frames/ # relevant frames 66
More Pictorial Results

67
Clustering and vector quantization
• Clustering is a common method for learning a visual
vocabulary or codebook
• Each cluster center produced by k-means becomes a
codevector
• Codebook can be learned on separate training set
• The codebook is used for quantizing features
• A vector quantizer takes a feature vector and maps it to the
index of the nearest code vector in a codebook
• Codebook = visual vocabulary
• Code vector = visual word
1 code vector 1
2 code vector 2
feature 3 code vector 3
vector

68
N code vector N
Another example visual vocabulary

69
Fei-Fei et al. 2005
Example codebook


Appearance codebook
70
Source: B. Leibe
Another codebook





Appearance codebook

71
Source: B. Leibe
Visual vocabularies: Issues

• How to choose vocabulary size?


• Too small: visual words not representative of all patches
• Too large: quantization artifacts,
overfitting
• Computational efficiency
• Vocabulary trees
(Nister & Stewenius, 2006)

72
3. Image representation: histogram of codewords
frequency

…..
codewords
73
Image classification
• Given the bag-of-features representations of images from
different classes, learn a classifier using machine learning

74
But what about layout?

All of these images have the same color histogram 75


Spatial pyramid representation
• Extension of a bag of features
• Locally orderless representation at several levels of resolution

level 0
76
Lazebnik, Schmid & Ponce (CVPR 2006)
Spatial pyramid representation
• Extension of a bag of features
• Locally orderless representation at several levels of resolution

level 0 level 1
77
Lazebnik, Schmid & Ponce (CVPR 2006)
Spatial pyramid representation
• Extension of a bag of features
• Locally orderless representation at several levels of resolution

level 0 level 1 level 2


78
Lazebnik, Schmid & Ponce (CVPR 2006)
Finale
• Describing images or image patches is very
important for matching and recognition
• The SIFT descriptor was invented in 1999 and is still
very heavily used.
• Other descriptors are also available, some much
simpler, but less powerful.
• Texture and shape descriptors are also useful.
• Bag-of-words is a handy technique borrowed from
text retrieval. Lots of people use it to compare images
or regions.
• Sivic developed a video frame retrieval system using
this method, called it Video Google.
• The spatial pyramid allows us to describe an image
as a whole and over its parts at multiple levels. 79
Thực hành
1.Vẽ hình minh họa kết quả xác định các keypoints và
descriptor cho các ảnh:
butterfly.jpg, home.jpg, simple.jpg
• Với các phương pháp:
BRIEF và ORB

80
Thực hành
2. Feature Matching
• Brute-Force Matching with BRIEF/ORB Descriptors
• FLANN with BRIEF/ORB Descriptors

Cho ảnh: left.jpg và right.jpg

• Ví dụ:

81

You might also like