Image Sequence (or Motion) Analysis
also referred as:
Dynamic Scene Analysis
Input: A Sequence of Images (or frames)
Output: Motion parameters (3-D)
of the objects/s in motion
and/or
3-D Structure information
of the moving objects
Assumption: The time interval ΔT between two
consecutive or successive pair of frames is small.
Typical steps in Dynamic Scene Analysis
• Image filtering and enhancement
• Tracking
• Feature detection
• Establish feature correspondence
• Motion parameter estimation (local and global)
• Structure estimation (optional)
• Predict occurrence in next frame
F1 TRACKING in a Dynamic Scene
Since ΔT is small,
the amount of motion/displacement
of objects,
F2 between two successive pair
of frames is also small.
F3
At 25 frames per sec (fps):
ΔT = 40 msec.
FN At 50 frames per sec (fps):
ΔT = 20 msec.
Image Motion
Image changes by difference
equation:
f d ( x1 , x2 , ti , t j )
= f ( x1 , x2 , ti ) − f ( x1 , x2 , t j )
= f (ti ) − f (ti ) = f i − f j
Accumulated difference image:
fT ( X , t n ) = f d ( X , t n −1 , t n ) − fT ( X , t n −1 ); n ≥ 3,
where, fT ( X , t 2 ) = f d ( X , t 2 , t1 )
Moving Edge (or feature) detector:
∂f
Fmov _ feat ( X , t1 , t 2 ) = . f d ( X , t1 , t 2 )
∂X
Recent methods include background and foreground modeling.
F1 F2
F1 - F2 abs(F1 - F2)
Input Video Frame
Segmented Video frame Extracted moving object
using Alpha matte
Categories of video tracking:
- region based; - Contour based
- Feature point based; - template-based
Region based -
Color features;
SSD using histogram bins for matching;
GMM based distance; ARMA models;
Background modeling and subtraction;
Optical flow and SVM based classification of moving parts;
Contour based –
Use Snakes (Active Contours) to detect object boundary,
And track ;
Model boundary using B-splines (initialized by user);
Graph cuts are used to segment object;
- Feature point based:
Hessian Affine, SIFT, SURF etc.
use cross-correlation or weighted optimization function
to detect correspondence (motion vectors).
KLT;
Gabor wavelets, deformable surface models.
Template based:
face, hand, feet, torso; object salient parts;
SAD, Hamming distance, SSD, NCC, joint entropy, mutual
information, max. likelihood
AAM, ADM, Bayesian object tracking etc.
R. Urtasun, D. Fleet and P. Fua, 3D People Tracking with Gaussian Process
Dynamical Models, Conference on Computer Vision and Pattern Recognition,
June 2006.
R. Urtasun, D. Fleet and P. Fua, Temporal Motion Models for Monocular and
Multiview 3--D Human Body Tracking, Computer Vision and Image
Understanding, Vol. 104, Nr. 2, pp. 157 - 177, December 2006.
A. Fossati, M. Dimitrijevic, V. Lepetit and P. Fua, From Canonical Poses to 3-D
Motion Capture using a Single Camera, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 32, Nr. 7, pp. 1165 - 1181, July 2010.
Physical Simulation for Probabilistic Motion Tracking, M.
Vondrak, L. Sigal and O. C. Jenkins, IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2008.
Two categories of Visual Tracking Algorithms:
Target Representation and Localization; Filtering and Data Association.
A. Some common Target Representation and Localization algorithms:
Blob tracking: Segmentation of object interior (for example blob detection, block-
based correlation or optical flow)
Kernel-based tracking (Mean-shift tracking): An iterative localization procedure
based on the maximization of a similarity measure (Bhattacharyya coefficient).
Contour tracking: Detection of object boundary (e.g. active contours or
Condensation algorithm)
Visual feature matching: Registration
B. Some common Filtering and Data Association algorithms:
Kalman filter: An optimal recursive Bayesian filter for linear functions and Gaussian
noise.
Particle filter: Useful for sampling the underlying state-space distribution of non-linear
and non-Gaussian processes. (Monte-Carlo + Bayesian; Condntl, density Propgn.).
Also see: Match moving; Motion capture; Swistrack, Occlusion Handling
BLOB Detection – Approaches used:
• Corner detectors
(Harris, Shi & Tomashi, Susan, Level Curve Curvature, Fast etc.)
• Ridge detection, Scale-space Pyramids
• LOG, DOG, DOH (Det. Of Hessian)
• Hessian affine, SIFT (Scale-invariant feature transform)
• SURF (Speeded Up Robust Features)
• GLOH (Gradient Location and Orientation Histogram)
• LESH (Local Energy based Shape Histogram).
Complexities and issues in tracking:
Need to overcome difficulties that arise from noise, occlusion, clutter,
moving cameras, multiple moving objects and changes in the foreground
objects or in the background environment.
DOH - the scale-normalized determinant of the Hessian, also
referred to as the Monge–Ampère operator,
where, HL denotes the Hessian matrix of L and then detecting
scale-space maxima/minima of this operator one obtains
another straightforward differential blob detector with automatic
scale selection which also responds to saddles.
Hessian Affine:
SURF is based on a set of
2-D HAAR wavelets; implements DOH
SIFT: Four major steps:
Detect extremas 1. Scale-space extrema detection
at various scales: 2. Keypoint localization
3. Orientation assignment
4. Keypoint descriptor
Only three steps (1, 3 & 4) are shown below:
Histograms contain 8 bins each, and each descriptor contains an
array of 4 histograms around the keypoint. This leads to a SIFT feature vector
with (4 x 4 x 8 = 128 elements).
e.g.
2*2*8:
GLOH (Gradient Location and Orientation Histogram)
Gradient location-orientation histogram (GLOH) is an
extension of the SIFT descriptor designed to increase its
robustness and distinctiveness. The SIFT descriptor is
computed for a log-polar location grid with 3 bins in radial
direction (the radius set to 6, 11 and 15) and 8 in angular
direction, which results 17 location bins.
Note that the central bin is not divided in angular
directions. The gradient orientations are quantized in 16 bins.
This gives a 272 bin histogram.
The size of this descriptor is reduced with PCA. The
covariance matrix for PCA is estimated on 47000 image
patches collected from various images. The 128 largest
eigenvectors are used for description.
Gradient location and orientation
histogram (GLOH) is a new descriptor
which extends SIFT by changing the
location grid and using PCA to reduce
the size.
Motion Equations:
o x
Models derived from mechanics.
Euler’s or Newton’s equations:
P’ = R.P + T, where: (X’, Y’)
(X, Y)
y
m11 = n12 + (1 − n12 ) cos θ
m12 = n1n2 (1 − cos θ ) − n3 sin θ
m13 = n1n3 (1 − cos θ ) + n2 sin θ z P(x, y, z)
P’(x’, y’, z’) at t1
R = [mij ]3*3 at t2
m21 = n1n2 (1 − cos θ ) + n3 sin θ
T = [∂x ∂y ∂z ]
T
m22 = n22 + (1 − n22 ) cos θ
2 2 2
m23 = n2 n3 (1 − cos θ ) − n1 sin θ where: n + n + n =1
1 2 3
Observation in the image plane:
m31 = n1n3 (1 − cos θ ) − n2 sin θ U = X’ – X; V = Y’ – Y.
m32 = n2 n3 (1 − cos θ ) + n1 sin θ X = Fx / z; Y = Fy / z.
X ' = Fx' / z ' ; Y = Fy ' / z '.
m33 = n32 + (1 − n32 ) cos θ
Mathematically (for any two successive frames), the problem is:
Input: Given (X, Y), (X’, Y’);
Output: Estimate n1, n2, n3, θ, δx, δy, δz
First look at the problem of estimating motion parameters
using 3D knowledge only:
Given only three (3) non-linear equations, you have to
obtain seven (7) parameters.
Need a few more constraints and may be assumptions too.
Since ΔT is small, θ must also be small enough (in radians).
Thus R simplifies (reduces) to:
1 − n3θ n2θ 1 − φ3 φ2
R θ → 0 = n3θ 1 − n1θ = φ3 1 − φ1
− n2θ n1θ 1 − φ2 φ1 1
where , φ12 + φ22 + φ32 = θ 2 Now (problem linearized),
given three (3) linear equations,
Evaluate R . you have to obtain six (6) parameters.
- Solution ?
Take two point correspondences:
P1' = R.P1 + T ; P2' = R.P2 + T ;
Subtracting one from the other, gives: ' '
(eliminates the translation component) ( P − P ) = R.( P1 − P2 )
1 2
Δx12'
1 − φ3 φ2 Δx12
' Δy , Solve φ1
Δ y
12 3= φ 1 − φ1 12 φ ;
Δz12' for: Φ = 2
− φ2 φ1 1 Δz12
φ3
where , Δx12 = x1 − x 2 ,
Δx12'
= x1' − x 2' ; 0 Δz12 − Δy12
∇ 12 = − Δz12 0 Δx12
and so on .... for y and z.
Δy12 − Δx12 0
Re-arrange to form: φ1 Δx12
'
− Δx12
[∇12 ]φ2 = Δ12 and
2
2 '
Δ 12 = Δy12 − Δy12
φ3 Δz12
'
− Δ z12
φ1
[∇12 ]φ2 = Δ12 0
2
Δz12 − Δy12 Δx12' − Δx12
'
φ3 ∇12 = − Δz12 0 2
Δx12 and Δ12 = Δy12 − Δy12
Δy12 − Δx12 0 Δz12' − Δz12
∇ 12 is a skew - symmetric matrix.
Interprete, why is it so ?
∇ 12 = 0 So what to do ?
Contact a Mathematician? φ1
[∇ 34 ]φ2 = Δ234
Take two (one pair) more point correspondences:
φ3
− Δz34 0 Δx34 Δy34
'
− Δy34
'
∇ 34 = Δy34 − Δx34 2
0 and Δ 34 = Δz34 − Δz34
0 Δz34 − Δy34 Δx34
'
− Δx
34
Using two pairs (4 points) φ1 φ1
of correspondences:
[∇12 ]φ2 = Δ12
2
[∇ 34 ]φ2 = Δ 34
2
Adding:
φ1 φ3 φ3
[∇12 + ∇ 34 ]φ2 = [Δ12 + Δ 34 ] [∇1234 ][Φ ] = [Δ1234 ]
2 2 2
φ3 − Δz34 0 Δx34
0 Δz12 − Δy12 ∇ 34 = Δy34 − Δx34 0
∇12 = − Δz12 0 Δx12 0 Δz34 − Δy34
Δy12 − Δx12 0
Δy34
'
− Δy34
Δx12' − Δx12 '
and Δ234 = Δz34 − Δz34
and Δ212 = Δy12' − Δy12 Δx34
'
− Δx
Δz12' − Δz12 34
− Δz34 Δz12 Δx34 − Δy12 Δy34
' '
− Δy34 + Δx12 − Δx12
'
∇1234 = Δy34 − Δz12 − Δx34 Δx12 ; Δ21234 = Δz34 '
− Δz34 + Δy12 − Δy12
Δy12 − Δx12 − Δz34 − Δy34 Δx34
'
− Δx + Δz '
12 − Δz12
34
φ1
[∇12 + ∇ 34 ]φ2 = [Δ212 + Δ234 ] [∇1234 ][Φ ] = [Δ21234 ]
φ3
Condition for existence of a unique solution is based on a
geometrical relationship of the coordinates of four points in space,
at time t1:
− Δz34 Δz12 Δx34 − Δy12
∇1234 = Δy34 − Δz12 − Δx34 Δx12
Δy12 − Δx12 − Δz34 − Δy34
Δy34
' '
− Δy34 + Δx12 − Δx12
2 ' '
and Δ1234 = Δz34 − Δz34 + Δy12 − Δy12
Δx34'
− Δx + Δz '
− Δz
34 12 12
This solution is often used as an initial guess for the final
estimate of the motion parameters.
Find geometric condition, when: ∇1234 = 0
OPTICAL FLOW
A point in 3D space:
X 0 = [kxo kyo kzo k ]; k ≠ 0, an arbitary constant.
Image point: X i = [wxi wyi w] where, X i = PX 0
xo
xi zo
Assuming normalized focal length, f = 1:
y = y ....(1)
i o
zo
Assuming linear motion model (no acceleration), between
successive frames:
xo (t ) xo + ut Combining equations (1) and (2):
y (t ) = y + vt ....(2)
o o ( xo + ut )
zo (t ) zo + wt xi (t ) ( zo + wt )
= ( yo + vt ) ....(3)
yi (t )
( zo + wt )
( xo + ut )
Assume in equation (3), xi (t ) ( zo + wt )
that w (=dz/dt) < 0. y (t ) = ( y + vt ) ....(3)
i o
( z o + wt )
In that case, the object (or points on the object) will
appear to come closer to you. Now visualize, where does these
points may come out from?
This point “e” is a point in
xi (t ) u
Lt w the image plane, known as
= = e.....(4) the:
i v w
t → −∞ y (t )
FOE (Focus of Expansion).
The motion of the object points appears to emanate from a
fixed point in the image plane. This point is known as FOE.
Approaches to calculate FOE are based on the exploitation
of the fact that for constant velocity object motion all image plane
flow vectors intersect at the FOE.
Plot the vectors and extrapolate them to obtain FOE.
FOE
Image
Plane
FOE may not exist for all types of motion – say pure
ROTATION, as shown below.
Multiple FOE’s may exist for multiple object motion and
occlusion.
Depth from Motion
Time varying distance, D(t), in the image plane , is the
distance of an image point from the FOE:
D(t ) = X i − e = [ xi (t ) − u ]2 + [ yi (t ) − v ]2
w w
Lt D(t ) = 0
t → −∞ d [ D(t )]
Rate of change of distance D(t) is: V (t ) =
dt
Derive this to obtain d [ D(t )] w.D(t )
(home assignment): V (t ) = =−
dt zo (t )
This helps to define, TIME-TO-ADJACENCY equation:
D(t ) z (t )
TA = =−
V (t ) w
Assuming z is +ve and w is
D(t ) z (t ) –ve. D(t) is different for different
TA = =− pixels.
V (t ) w This equation holds for each
corresponding object and image
point pair.
Consider two object points, z1(t) and z2(t). Then:
D1 (t ) z1 (t ) D2 (t ) z 2 (t ) D2 (t ) V1 (t )
=− ; =− ; z 2 (t ) = z1 (t )
V1 (t ) w V2 (t ) w D (t )
1 2 V (t )
Di(t) and Vi(t) values for any object point can be obtained
from the image plane, once e (FOE) is obtained.
Hence it is possible to determine the Z 2 (t )
relative 3-D depths :
of any two object points,
solely from the image plane motion data.
Z1 (t )
This is the key idea of Structure from motion (SFM)
problem, and you are able to extract the shape (structure)
information of the object in motion up to a certain scale factor,
from a single perspective view only.
Another important idea of optical flow is based on the
Horn’s (Horn-Schunk, 1980) equations. A global energy function
is sought to be minimized, whose functional form is given as:
KLT tracker:
or,
• Lucas B D and Kanade T, 1981, An iterative image registration technique
with an application to stereo vision. Proceedings of Imaging understanding
workshop, pp 121—130.
• Horn, B.K.P. and Schunck, B.G., "Determining optical flow." Artificial
Intelligence, vol 17, pp 185-203, 1981.
Concepts from above are left for self-study ……..
Motion Analysis using rigid body assumption
Rigid Body Assumption:
2
xi − x j = cij , ∀t , ∀(i, j ), where cij are constants.
Motion Equation: m11 m12 m13 l1
m m 22 m 23 l2
X (t 2 ) = M . X (t1 ), where, M = 21 , mij = f (n1 , n2 , n3 ,θ ).
m 31 m 32 m 33 l3
0 0 0 1
X (ti ) = [x(ti ) y (ti ) z (ti ) 1]
T
Form matrix A(ti), using four points as:
A(ti ) = [X 1 (ti ) X 2 (ti ) X 3 (ti ) X 4 (ti )]
−1
Thus obtain the matrix M, using: M = A(t j ). A(ti )
Points must be selected in such a fashion that A(ti) is a
non-singular matrix.
Can you guess, when A(ti) will be singular ?
After mij’s are
obtained: m11 + m22 + m33 − 1 m32 − m23
cos(θ ) = ; sin(θ ) = .
2 2n1
m11 − cos(θ ) m21 + m12 m31 + m13
n1 = ; n2 = ; n3 = .
1 − cos(θ ) 2n1 (1 − cos(θ )) 2n1 (1 − cos(θ ))
This is fine in an ideal case. In noisy situations (or even
with numerical errors):
A(ti ) = M . A(t j ) + N ij
Need to formulate an optimization function to minimize a
cost function, to satisfy equations in the least square sense:
X k (ti ) = M . X k (t j ), k = 1,2,..., K
For example, minimize:
K
[ X
k =1
k
2
(ti ) − M . X k (t j )] along with Rigidity constraint.
Heard about
Use linearized solution as your
initial estimate.
SA or GA?
Work out the following (2nd method):
X k (ti ) = R. X k (t j ), k = 1,2,..., K
where,
Rp T
R= R p = Rα Rβ Rλ
0 1
Again, we have 12 unknown elements in R,
which are functions of six unknown parameters.
First obtain the 12 unknown elements and
then get the six parameters.
Motion Analysis
using
Image plane coordinates
of the features
of the moving object.