Computer vision, Feature
extraction and Deep Learning
by
Ankit Jha
Resources courtesy: Google, Medium Blogs, Courses from IIT Bombay, MIT, etc.
Computer Vision
Make computers understand images and
video.
What kind of scene?
Where are the cars?
How far is the
building?
…
Vision is really hard
• Vision is an amazing feat of natural
intelligence
– Visual cortex occupies about 50% of Macaque brain
– More human brain devoted to vision than anything else
Is that a
queen or a
bishop?
Why computer vision matters
Safety Health Security
Comfort Fun Access
Ridiculously brief history of computer vision
• 1966: Minsky assigns computer vision
as an undergrad summer project
• 1960’s: interpretation of synthetic
worlds
Guzman ‘68
• 1970’s: some progress on interpreting
selected images
• 1980’s: ANNs come and go; shift toward
geometry and increased mathematical
rigor
• 1990’s: face recognition; statistical Ohta Kanade ‘78
analysis in vogue
• 2000’s: broader recognition; large
annotated datasets available; video
processing starts
Turk and Pentland ‘91
How vision is used now
• Examples of state-of-the-art
Some of the following slides by Steve Seitz
Optical character recognition (OCR)
Technology to convert scanned docs to text
• If you have a scanner, it probably came with OCR software
Digit recognition, AT&T labs License plate readers
[Link] [Link]
Face detection
• Many new digital cameras now detect faces
– Canon, Sony, Fuji, …
Smile detection
Sony Cyber-shot® T70 Digital Still Camera
3D from thousands of images
Building Rome in a Day: Agarwal et al. 2009
Object recognition (in supermarkets)
LaneHawk by EvolutionRobotics
“A smart camera is flush-mounted in the checkout lane, continuously
watching for items. When an item is detected and recognized, the
cashier verifies the quantity of items that were found under the basket,
and continues to close the transaction. The item can remain under the
basket, and with LaneHawk,you are assured to get paid for it… “
Vision-based biometrics
“How the Afghan Girl was Identified by Her Iris Patterns” Read the story
wikipedia
Login without a password…
Face recognition systems now
Fingerprint scanners on
beginning to appear more widely
many new laptops, [Link]
other devices
Object recognition (in mobile phones)
Point & Find, Nokia
Google Goggles
Special effects: shape capture
The Matrix movies, ESC Entertainment, XYZRGB, NRC
Special effects: motion capture
Pirates of the Carribean, Industrial Light and Magic
Sports
Sportvision first down line
Nice explanation on [Link]
[Link]
Smart cars Slide content courtesy of Amnon Shashua
• Mobileye
– Vision systems currently in high-end BMW, GM,
Volvo models
– By 2010: 70% of car manufacturers.
Google cars
[Link]
Interactive Games: Kinect
• Object Recognition:
[Link]
• Mario: [Link]
• 3D: [Link]
• Robot: [Link]
Vision in space
NASA'S Mars Exploration Rover Spirit captured this westward view from atop
a low plateau where Spirit spent the closing months of 2007.
Vision systems (JPL) used for several tasks
• Panorama stitching
• 3D terrain modeling
• Obstacle detection, position tracking
• For more, read “Computer Vision on Mars” by Matthies et al.
Industrial robots
Vision-guided robots position nut runners on wheels
Mobile robots
NASA’s Mars Spirit Rover
[Link] [Link]
Saxena et al. 2008
STAIR at Stanford
Medical imaging
Image guided surgery
3D imaging
Grimson et al., MIT
MRI, CT
Deep learning for visual inference
Human Vision / Human Brain
Geometry
Machine Learning
Computer Vision
Deep Learning
Optics /
Cameras
Robotics
This course
Human Vision / Human Brain
Geometry
Machine Learning
Computer Vision
Deep Learning
Optics /
Cameras
Robotics
Relationship with Other Fields
• Image Processing: Image Image
Relationship with Other Fields
• Computer Vision: Image Knowledge
cat
deer
Relationship with Other Fields
• Computer Graphics: Knowledge Image
Vertices, Locations, Objects,
Shapes, Colors, Material properties,
Lighting settings, Camera settings, etc.
Visual Recognition?
• What does it mean to “see”?
• “What” is “where”, Marr 1982
• Get computers to “see”
Verification
Is this a car?
Classification:
Is there a car in this picture?
Detection:
Where is the car in this picture?
Pose Estimation:
Activity Recognition:
What is he doing? What is he doing?
Object Categorization:
Sky
Person
Tree
Horse
Car
Person
Bicycle
Road
Segmentation
Sky
Tree
Car
Person
Describing Images with Language
Text-to-Image Synthesis: Text2Scene
Object recognition
Is it really so hard?
Find the chair in this image Output of normalized correlation
This is a chair
Object recognition
Is it really so hard?
Find the chair in this image
Pretty much garbage
Simple template matching is not going to make it
This is an image to us
This is an image to a computer
Challenges 1: view point variation
Michelangelo 1475-1564 slide by Fei Fei, Fergus & Torralba
Challenges 2: illumination
slide credit: S. Ullman
Challenges 3: occlusion
Magritte, 1957 slide by Fei Fei, Fergus & Torralba
Challenges 4: scale
slide by Fei Fei, Fergus & Torralba
Challenges 5: deformation
slide by Fei Fei, Fergus & Torralba Xu, Beihong 1943
Challenges 6: background clutter
Klimt, 1913 slide by Fei Fei, Fergus & Torralba
Challenges 7: object intra-class variation
slide by Fei-Fei, Fergus & Torralba
Challenges 8: local ambiguity
slide by Fei-Fei, Fergus & Torralba
Challenge 9 – labels are ambiguous
Challenge 10 – different spatial resolution
Challenge 11 – medical images
History of ideas in recognition
• 1960s – early 1990s: the geometric era
• 1990s: appearance-based models
• Mid-1990s: sliding window approaches
• Late 1990s: local features
• Early 2000s: parts-and-shape models
• Mid-2000s: bags of features
• Present trends: combination of local and global methods, data-
driven methods, context
Svetlana Lazebnik
Recognition as an alignment problem:
Block world
L. G. Roberts, Machine
Perception of Three
Dimensional Solids, Ph.D.
thesis, MIT Department
of Electrical Engineering,
1963.
J. Mundy, Object Recognition in the Geometric Era: a Retrospective, 2006
Recognition by components
Biederman (1987)
Primitives (geons) Objects
[Link]
Svetlana Lazebnik
History of ideas in recognition
• 1960s – early 1990s: the geometric era
• 1990s: appearance-based models
• Mid-1990s: sliding window approaches
• Late 1990s: local features
• Early 2000s: parts-and-shape models
• Mid-2000s: bags of features
• Present trends: combination of local and global methods, data-
driven methods, context
Svetlana Lazebnik
Eigenfaces (Turk & Pentland, 1991)
Svetlana Lazebnik
Color Histograms
Swain and Ballard, Color Indexing, IJCV 1991. Svetlana Lazebnik
History of ideas in recognition
• 1960s – early 1990s: the geometric era
• 1990s: appearance-based models
• Mid-1990s: sliding window approaches
• Late 1990s: local features
• Early 2000s: parts-and-shape models
• Mid-2000s: bags of features
• Present trends: combination of local and global methods, data-
driven methods, context
Svetlana Lazebnik
Sliding window approaches
History of ideas in recognition
• 1960s – early 1990s: the geometric era
• 1990s: appearance-based models
• Mid-1990s: sliding window approaches
• Late 1990s: local features
• Early 2000s: parts-and-shape models
• Mid-2000s: bags of features
• Present trends: combination of local and global methods, data-
driven methods, context
Svetlana Lazebnik
Local features for object instance
recognition
D. Lowe (1999, 2004)
History of ideas in recognition
• 1960s – early 1990s: the geometric era
• 1990s: appearance-based models
• Mid-1990s: sliding window approaches
• Late 1990s: local features
• Early 2000s: parts-and-shape models
• Mid-2000s: bags of features
• Present trends: combination of local and global methods, data-
driven methods, context
Svetlana Lazebnik
Representing people
History of ideas in recognition
• 1960s – early 1990s: the geometric era
• 1990s: appearance-based models
• Mid-1990s: sliding window approaches
• Late 1990s: local features
• Early 2000s: parts-and-shape models
• Mid-2000s: bags of features
• Present trends: combination of local and global methods, data-
driven methods, context
Svetlana Lazebnik
Bag-of-words models
Bag of
Object
‘words’
Svetlana Lazebnik
Deep Learning and Vision
• Deep Learning has been a great disruption into the field of Computer
Vision. Has made a lot of new things work!
• Many deep learning methods being applied to vision these days.
• This is a deep learning course for visual inference. We will briefly review
some pre-deep learning methods, and then mostly deep learning.
Objectives
▪ Understanding foundational concepts for representation learning
using neural networks
▪ Becoming familiar with state-of-the-art models for tasks such as
image classification, object detection, image segmentation, scene
recognition, etc.
▪ Obtain practical experience in the implementation of visual
recognition models using deep learning.
Resources
• Online lectures (Stanford, MIT, NYU, IITM)
• Deep learning book by GoodFellow
• [Link]
• Deep Learning Methods and Applications by Deng & Yu
• [Link]
apers/[Link]
• Papers
Pytorch, Tensorflow, Keras
• Pytorch – zero to GAN
• [Link]
• Keras by deepLizard
• [Link]
• Tensorflow tutorial
• [Link]
Traditional ML vs Deep Learning (Recap)
Image acquisition
z = 𝑓(𝑥, 𝑦)
Electromagnetic Spectrum – characterizing
the energy
Human Luminance Sensitivity Function
[Link]
Beyond visible spectra – satellite images
The Infra-red region is better in characterizing the classes that the visible region
Multi-spectral and hyper-spectral
Spatial resolution
Different rep. of 3D information in images
Video data – appearance + motion (optical
flow)
Optical flow - a quick review
Color coding the optical flow
What are Image Features?
Color histogram?
Image
Maximum color on sub-
0.22
areas of the image?
0.30
0.13
Any statistics on the
0.24
input image?
0.31
0.15
The output of some
0.35
image processing on
0.48
the input image?
Why are they useful?
Image
0.22
0.30
0.13 Machine
0.24
0.31 Learning Predictions
0.15
0.35
Model
0.48
As inputs to a machine learning model
Why are they useful?
0.22
0.30
0.13
0.24
0.31
0.15
0.35
0.48 Distance function
(e.g. Euclidean
0.24
0.34 distance)
0.23
0.27
0.63
0.15
0.25
0.48
To compare images (i.e. retrieve similar images)
Feature detection and description
Visual features
• Color histogram
• Edge & boundary feature
• Shape feature Invariant to transformations
• Texture feature - Scale
- Rotation
• Interests points based feature - Translation
- Affine transformation
• Deep features - Illumination change
- Some more
Image Features: Color
Image Features: Color
Color often not a powerful feature
However, these are all images of people but the
colors in each image are very different.
Texture feature
But textures are not always easy to quantify
Image filtering: Convolution operator
𝑘(𝑥, 𝑦)
𝑔 𝑥, 𝑦 = 𝑘 𝑢, 𝑣 𝑓(𝑥 − 𝑢, 𝑦 − 𝑣)
𝑣 𝑢
Image Credit: [Link]
[Link]
Image filtering: e.g. Mean Filter
Image filtering: Convolution operator
Important filter: gaussian filter (gaussian blur)
1/16 1/8 1/16
𝑘(𝑥, 𝑦) = 1/8 1/4 1/8
𝑘(𝑥, 𝑦) 1/16 1/8 1/16
Important filter: Gaussian
• Weight contributions of neighboring pixels by nearness
0.003 0.013 0.022 0.013 0.003
0.013 0.059 0.097 0.059 0.013
0.022 0.097 0.159 0.097 0.022
0.013 0.059 0.097 0.059 0.013
0.003 0.013 0.022 0.013 0.003
5 x 5, = 1
Slide credit: Christopher Rasmussen
Image filtering: Convolution operator
e.g. gaussian filter (gaussian blur)
Image Credit: [Link]
Practice with linear filters
?
0 0 0
0 1 0
0 0 0
Original
Source: D. Lowe
Practice with linear filters
0 0 0
0 1 0
0 0 0
Original Filtered
(no change)
Source: D. Lowe
Practice with linear filters
?
0 0 0
0 0 1
0 0 0
Original
Source: D. Lowe
Practice with linear filters
0 0 0
0 0 1
0 0 0
Original Shifted left
By 1 pixel
Source: D. Lowe
Practice with linear filters
-
0 0 0 1 1 1
0 2 0
0 0 0
1 1 1
1 1 1
?
(Note that filter sums to 1)
Original
Source: D. Lowe
Practice with linear filters
-
0 0 0 1 1 1
0 2 0 1 1 1
0 0 0 1 1 1
Original Sharpening filter
- Accentuates differences
with local average
Source: D. Lowe
Key properties of linear filters
Linearity:
imfilter(I, f1 + f2) =
imfilter(I,f1) + imfilter(I,f2)
Shift invariance: same behavior regardless of pixel
location
imfilter(I,shift(f)) = shift(imfilter(I,f))
Any linear, shift-invariant operator can be represented
as a convolution
Source: S. Lazebnik
Image filtering: Convolution operator
Important Filter: Sobel operator
1 0 -1
𝑘(𝑥, 𝑦) = 2 0 -2
𝑘(𝑥, 𝑦) 1 0 -1
Image Credit: [Link]
1 0 -1
2 0 -2
1 0 -1
Sobel
Vertical Edge
Slide by James Hays (absolute value)
1 2 1
0 0 0
-1 -2 -1
Sobel
Horizontal Edge
Slide by James Hays (absolute value)
Sobel operators are equivalent to 2D partial
derivatives of the image
• Vertical sobel operator – Partial derivative in X (width)
• Horizontal sobel operator – Partial derivative in Y (height)
• Can compute magnitude and phase at each location
• Useful for detecting edges
Sobel filters are (approximate) partial
derivatives of the image
Let 𝑓(𝑥, 𝑦) be your input image, then the partial derivative is:
𝜕𝑓(𝑥, 𝑦) 𝑓 𝑥 + ℎ, 𝑦 − 𝑓(𝑥, 𝑦)
= lim
𝜕𝑥 ℎ→0 ℎ
𝜕𝑓(𝑥, 𝑦) 𝑓 𝑥 + ℎ, 𝑦 − 𝑓(𝑥 − ℎ, 𝑦)
Also: = lim
𝜕𝑥 ℎ→0 2ℎ
But digital images are not continuous, they
are discrete
Let 𝑓[𝑥, 𝑦] be your input image, then the partial derivative is:
Δ𝑥 𝑓[𝑥, 𝑦] = 𝑓[𝑥 + 1, 𝑦] − 𝑓[𝑥, 𝑦]
Also: Δ𝑥 𝑓[𝑥, 𝑦] = 𝑓[𝑥 + 1, 𝑦] − 𝑓[𝑥 − 1, 𝑦]
But digital images are not continuous, they
are discrete
Let 𝑓[𝑥, 𝑦] be your input image, then the partial derivative is:
Δ𝑥 𝑓[𝑥, 𝑦] = 𝑓[𝑥 + 1, 𝑦] − 𝑓[𝑥, 𝑦] k(x, y) = -1 1
Also: Δ𝑥 𝑓[𝑥, 𝑦] = 𝑓[𝑥 + 1, 𝑦] − 𝑓[𝑥 − 1, 𝑦] k(x, y) = -1 0 1
Sobel Operators Smooth in Y and then
Differentiate in X
1 1 0 -1
k(x, y) = 2 * 1 0 -1 = 2 0 -2
1 1 0 -1
Similarly to differentiate in Y
From Wikipedia
Gabor filters
Image gradient in a nuttshell
Gradient histogram
Image Features: HoG
Paper by Navneet Dalal & Bill Triggs presented at CVPR 2005 for detecting people.
Image Features: HoG
Image Features: HoG
Compute gradients
𝐼𝑥 𝐼𝑦 𝐼𝑥2 + 𝐼𝑦2
*
Image Features: HoG
We will aggregate
gradient magnitude
and directions
in 8x8 pixel regions
Image Features: HoG
Compute a histogram
with 9 bins for angles
from 0 to 180
Image Features: HoG
Normalize histograms
with respect to
histograms of adjacent
neighbors.
Image Features: HoG
Image (or image region)
represented by a vector
containing all the
histograms.
In this case how long is
that vector?
Interest points - motivation
Interest points - motivation
Panorama stitching
Corners
Cornerness
The state of the art interest point detector -
SIFT
Won’t cover in the class – Pl refer to David Lowe paper for details
Visual words
Image Features: Bag of (Visual) Words
Representation
Bag of Features (read more in Gabriela Csurka’s ECCV 2004 paper) slide by Fei-fei Li
Extract SIFT
Feature
Descriptors
Intuition: sample a bunch of pieces (“words”) from various parts of the image. Images of the same object are
more likely to share similar pieces
slide by Fei-fei Li
Extract SIFT
Feature
Descriptors
Create Dictionary
of Distinctive SIFT
Features
slide by Fei-fei Li
Extract SIFT
Feature
Descriptors
Compute
Histograms of
Features
slide by Fei-fei Li
GIST
For more, see [Link]
Idea of low, mid and high level features
• Edge, point, line – low level
• Visual words – mid level
• Object feature – high level (more semantic meanings)
✓ However, we have different set of techniques for extracting such features before deep learning
✓ Those techniques are not optimized
✓ No idea on which feature is good for which data/task
But, this gives the notion of hierarchical features!
In deep learning
Suggested reading
Suggested readings