The Three R's of Computer Vision:: Jitendra Malik UC Berkeley

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

The

 Three  R’s  of  Computer  Vision:  


Recogni6on,  Reconstruc6on  &  Reorganiza6on  

Jitendra Malik

UC Berkeley

Recogni6on,  Reconstruc6on  &  Reorganiza6on  

Recogni6on  

Reconstruc6on   Reorganiza6on  
The  Three  R’s  of  Vision  
Recognition

Reconstruction Reorganization

Each of the 6 directed arcs in this diagram is a useful direction


of information flow
What  we  would  like  to  infer…  

Will person B put some money into Person C’s tip bag?

Different  aspects    of  vision  
•  Percep6on:  study  the  “laws  of  seeing”  -­‐predict  what  a  human  
would  perceive  in  an  image.  
•  Neuroscience:  understand  the  mechanisms  in  the  re6na  and  
the  brain    
•  Func6on:    how  laws  of  op6cs,  and  the  sta6s6cs  of  the  world  
we  live  in,  make  certain  interpreta6ons  of  an  image  more  
likely  to  be  valid  

The match between human and computer vision is strongest at the


level of function, but since typically the results of computer vision are
meant to be conveyed to humans makes it useful to be consistent
with human perception. Neuroscience is a source of ideas but being
bio-mimetic is not a requirement.
Facts  about  the  Visual  World  
•  The  world  consists  of  rigid,  or  piecewise  rigid,  objects  
•  Object  surfaces  have  piecewise  constant  color  and  texture  
•  Objects    in  a  category  share  parts  
•  Projec6on    from  the  3D  world  to  2D  image  is  uniquely  defined  
•  Objects  are    opaque  &  nearer  objects  occlude  farther  objects.  
•  Objects  occur  at  varying  distances  and  loca6ons  in  the  image  
•  Objects  occur  in  context,  stereotypical  rela6ons  to  each  other  
•  Ac6ons  are  performed  by  agents  with  goals  and  inten6ons  
•  ….  
We can incorporate these into the design of visual

processing architectures. Parameters should be learnt.

Binocular Stereopsis
Op6cal  flow  is  a  basic  cue  for  all  animals    
Epipolar    geometry  for  cameras  in  general  posi6on  
State  of  the  Art  in  Reconstruc6on  
•  Mul6ple  photographs   •  Range  Sensors  

Kinect (PrimeSense)

Velodyne Lidar
Agarwal et al (2010)
Frahm et al, (2010)

Semantic Segmentation is needed to make this more
useful…
Some  Pictorial  Cues  
Shading  
Cast  Shadows  
The  Visual  Pathway  
Hubel  and  Wiesel  (1962)  discovered  orienta6on  sensi6ve  
neurons  in  V1  
Block  Diagram  of  the  Primate  Visual  System  

D. Van Essen Lab



Feed-­‐forward  model  of  the  ventral  stream  

Kravitz et al, Trends in Cognitive Science 2013



Convolu6onal  Neural  Networks  (LeCun  et  al  )    
Used  backpropaga6on  to  train  the  weights  in  this  architecture  
•  First  demonstrated  by  LeCun  et  al    for    handwriben  digit  recogni6on(1989)  
•  Krizhevsky,  Sutskever  &  Hinton  showed  effec6veness  for  full  image  
classifica6on  on  ImageNet  Challenge  (2012)  
•  Girshick,  Donahue,  Darrell  &  Malik  (arxiv,  2013)(CVPR  2014)  showed  that  
these  features  were  also  effec6ve  for  object  detec6on  
•  And  many  others…  
 
The  Three  R’s  of  Vision  
Recognition

Reconstruction Reorganization

Each of the 6 directed arcs in this diagram is a useful direction


of information flow
The  Three  R’s  of  Vision  
Recognition

Superpixel
assemblies as
candidates

Reconstruction Reorganization
Bobom-­‐up  grouping  as  input  to  recogni6on  

We  produce  superpixels  of  coherent  color  and  texture  first,    


then  combine  neighboring  ones  to  generate  object  candidates      
R-­‐CNN:  Regions  with  CNN  features  
Girshick,  Donahue,  Darrell  &  Malik  (CVPR  2014)  

Input
Extract region
Compute CNN
Classify regions

image
proposals (~2k / image)
features
(linear SVM)

CNN  features  are  inspired  by  the  


 architecture  of  the  visual  system  
PASCAL  Visual  Object  Challenge   (Everingham  et  al)  
State  of  the  Art  in  Recogni6on  
 

(Slide  from  D.  Hoiem)  


How  about  the  other  direc6on…  
Recognition

Reconstruction Reorganization
Recogni6on  Helps  Reorganiza6on  
 Results  of  Simultaneous  Detec6on  and  Segmenta6on  
Hariharan,  Arbelaez,  Girshick  &  Malik  (2014)  

We  mark  the  pixels  corresponding  to  an    


object  instance,  not  just  its  bounding  box.    
More  results  
We  train  classifiers  to    predict  top-­‐down  
 the  pixels  belonging  to  the  object  

Score  

Original  detec6on  

Search   Segment   Score  


nearby  

Regress  
boxes   Score  
Ac6ons  and  Abributes  from  Wholes  and  Parts  
G.  Gkioxari,  R.  Girshick  &  J.  Malik  
Finding  Human  Body  Joints  
Viewpoint  Predic6on  for  Objects  
Tulsiani  &  Malik  (2014)  

The columns show 15th, 30th, 45th, 60th, 75th and 90th percentile instances respectively

in terms of the error.

Keypoint  Predic6on  for  Objects  

Visualization of keypoints predicted in the detection setting. We sort the keypoints


detections by their prediction score and visualize every 15th detection for ’Nosetip’ of
aeroplanes, ’Left Headlight’ of cars and ’Crankcentre’ of bicycles.

The  Three  R’s  of  Vision  
Recognition

Reconstruction Reorganization

We  have  explored  category-­‐specific  3D  reconstruc6on.


Category  Specific  Object  Reconstruc6on  
Kar,  Tulisiani,  Carreira  &  Malik  
Deformable  3D  Model  Learning  

•  Viewpoint  es6ma6on  –  NRSfM  on  keypoint  correspondences  

•  Idea  -­‐  Deform  a  mesh  to  sa6sfy  silhoue,es  from  different  viewpoints  

•  Energies  -­‐  Consistency,  Coverage,  Smoothness,  Keypoint    

•  Intra-­‐class  Varia6on  -­‐  Linear  deforma6on  modes  


Basis  Shape  Models  
Results  
The  Three  R’s  of  Vision  
Recognition

Reconstruction Reorganization

These ideas apply equally well in a video setting


Images Video

Image classification Object detection Action classification Action detection

“Is there a dog in the “Is there a dog and “Is there a person “Is there a person
image?” where is it in the diving in the video?” diving and where is
image?” it in the video?”
Results  on  UCF  Sports  (Gkioxari  &  Malik,  2014)  
Tracking error

Action prediction error


Results  on  J-­‐HMDB  
The  Three  R’s  of  Vision  
Recognition

Reconstruction Reorganization
Scene  Understanding  using  RGB-­‐D  data  
Gupta,  Girshick,  Arbelaez,  Malik  (ECCV  2014)  

Color and Depth Complete 3D


Image Pair Understanding
Overview
Input Re-organization Recognition Detailed 3D
Understanding

Contour Detection Semantic Segm.

Color and Depth Instance Segm.


Image Pair Object Detection
Region Proposal
Generation

Pose
Estimation 51

Instance Segmentation
I hope you enjoy the course!

You might also like