0% found this document useful (0 votes)
3 views69 pages

Non rigid structure from motion slides in cv

The lecture discusses Non-Rigid Structure from Motion (NRSfM), focusing on recovering 3D shapes and camera motion from monocular videos of non-rigid objects. It highlights the challenges of non-rigid motion compared to rigid motion, including the use of various priors and optimization techniques to solve the problem. Applications span multiple industries, such as film, sports, and robotics, emphasizing the importance of accurately modeling non-rigid shapes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views69 pages

Non rigid structure from motion slides in cv

The lecture discusses Non-Rigid Structure from Motion (NRSfM), focusing on recovering 3D shapes and camera motion from monocular videos of non-rigid objects. It highlights the challenges of non-rigid motion compared to rigid motion, including the use of various priors and optimization techniques to solve the problem. Applications span multiple industries, such as film, sports, and robotics, emphasizing the importance of accurately modeling non-rigid shapes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Lecture:

Non-Rigid Structure from Motion


-------------------------------------------------------------------------------------------------------------------------------

3D Vision
Universitat Pompeu Fabra
Discussion

Non-Rigid Shapes?

§ Can we obtain non-rigid 3D information from images?


Structure from Motion
(Rigid) Structure from Motion

Given: a monocular video (or a collection of pictures)


We want: simultaneously recovering the 3D shape and the camera
motion
Epipolar geometry can be used

The assumption of rigidity is enough to make the problem well-posed


What about non-rigid motion?
Our world is Non Rigid!

No external markers!
One or Many
Why is this important?

The world is non-rigid! Too many everyday applications in many


different domains

Movie industry, augmented reality Experimental industry

Sport industry: sailing Endoscopy


The movie industry
Even more details
er
e sp
m
0 fr a n d
0 o
1 2 sec
Animal Reconstruction
er
e sp
m
0 fr a n d
0 o
1 2 sec
…produces better robots?
Epipolar geometry can be used

The assumption of rigidity is enough to make the problem well-posed


Can Epipolar geometry be used?

Considering only one image, we obtain the same 3D constraint


Epipolar geometry can be used

After acquiring a new image, we obtain a similar constraint but now


triangulation is not available since the shape is non rigid
Epipolar geometry can be used

After acquiring a new image, we obtain a similar constraint but now


triangulation is not available since the shape is non rigid
Non-Rigid
Structure from Motion
Non-Rigid Structure from Motion
Given: a monocular video (or a collection of pictures)
We want: simultaneously recovering the 3D shape of a time-
varying object (4D estimation) and the camera motion
Some Results
Some Results
Some Results
Solving the problem

The problem can be solved by:


• Factorization: a closed-form solution can be achieved by using
SVD factorization, enforcing a specific rank (this can change as
a function of the type of camera model, or the type of scene). In
theory, it is hard to accurately enforce constraints
• Non-linear Optimization: the solution is achieved iteratively, the
computational cost can be bigger but additional priors can be
enforced accurately

In terms of processing, the problem can be solved:


• Offline: all the frames are processed at once, after video
capture
• Online: the frames are processed as the data arrive, frame by
frame. More real applications, but can become less accurate
Non-Rigid
Structure from Motion

Problem Statement
A Reminder of Camera Models

§ Perspective camera: All rays converge to the optical center


§ Orthographic camera: All rays are parallel. Z-coordinate is
irrelevant in the projection

Perspective camera Orthographic camera


3D-to-2D: Perspective Model

A p-th 3D point Xp=[Xp, Yp, Zp]T in homogeneous coordinates can be


related with its 2D projection xp=[xp, yp]T by means of a matrix Pi for
the i-th image, such as:

where Pi is 3x4 matrix as:


3D-to-2D: Orthographic Model

A p-th 3D point Xp=[Xp, Yp, Zp]T can be related with its 2D projection
xp=[xp, yp]T by means of a matrix Ri for the i-th image, such as:

where Ri is 2x3 matrix and ti is a 2x1 translation vector as:

In practice, we subtract the translations by assuming centered


observations (i.e., they are equivalent to the mean values of xp). For
later computations, we will approximate xp= xp- ti
Problem Statement
Orthographic camera

2xP

In
th th
e e
rig sa
id m
2xP 2x3 3xP 2xP

ca e
se
,i
ist
ge
3D ima
nt
re per
ffe
di ion
3IxP

A rat
u
ig
nf
co
2Ix3I
Full Linear Relation

2IxP
Orthographic camera
Measurement Matrix

Considering P non-rigid 3D points observed in I RGB images, we


can collect all observations to obtain a linear system such as:

3IxP 3Ix4I 4IxP 2IxP 2Ix3I 3IxP

Perspective camera Orthographic camera


where W is a 3IxP matrix, P is 3Ix4I, and X is 4IxP for the
perspective case (relation on the left); and W is a 2IxP matrix, R is
2Ix3I, and X is 3IxP for the perspective case (relation on the right)
What about the rank of W?

Considering P non-rigid 3D points observed in I RGB images, we


can collect all observations to obtain a linear system such as:

3IxP 3Ix4I 4IxP 2IxP 2Ix3I 3IxP

Perspective camera Orthographic camera

rank(W)≤ min(3I,P) rank(W)≤ min(2I,P)


A severely ill-posed problem
Orthographic camera

Th
is
is var
an iab
2IxP 2Ix3I 3IxP

ex les
pl
os
io
n
of
2ip entries << 6i variables + 3ip variables
A Toy Comparison
Let us assume a 1 minute video with just 100 tracked points, and
considering only the estimation of the 3D shape

Rigid Case Non-Rigid Case


Input data: Input data:
100 points x 60 sec x 30 Hz x 2 100 points x 60 sec x 30 Hz x 2
= 360,000 measurements = 360,000 measurements
Unknowns: Unknowns:
100 points x 3 100 points x 60 sec x 30 Hz x 3
= 300 unknowns = 540,000 unknowns

well-posed problem ill-posed problem


How can I solve the problem?

The art of priors

Including deformation priors is substantially more difficult than


using simple rigidity
Many possibilities were presented

A wide variety of priors in literature:


§ Physical priors. Particle dynamics, elasticity, finite elements,
and many others
§ Probabilistic priors. Low-rank models on shape, trajectory,
shape-trajectory or force domains. Union of subspaces,
Gaussian priors
§ Geometric priors: isometric, as rigid as possible, bone lengths,
quadratic models
§ Temporal priors: temporal-coherent deformations
§ Piecewise priors
§ Many others
Shape Linear Subspace
(a probabilistic prior)
A Low-Rank Shape Model

Basically, the non-rigid 3D shape can be obtained as a linear


combination of fixed shape vectors. For every combination of
weight coefficients, a different solution can be achieved:

Rotation Linear combination of Translation Your estimation


some shapes
Including the low-rank shape model

We approximate the 3D shape by a linear combination of K shape


vectors b (normally, K << P or I). For every k-th component, a
weight coefficient lk is needed. As the shape is non-rigid, by
modifying the coefficients for every i-th image, we will change the
3D shape as:

3IxP 3Ix3K 3KxP

Another type of expression for the i-th image:


Shape Basis Estimation
In non-rigid structure from motion, we have some alternatives to
estimate the shape basis:
§ The most natural is to learn it on the fly, using only the input data
§ The input data can also be used to estimate a shape basis from
a shape at rest (like a mean shape) by applying:
- Modal analysis based on physical models
- Spectral analysis based on a distance matrix
§ If training data are assumed, we learn it by means of a learning
approach (PCA, deep based, etc.). This approach is supervised
Non-Rigid
Structure from motion
by factorization
Including the low-rank shape model

Thanks to the relation between the 3D shape and the shape basis:
Orthographic camera

3IxP 3Ix3K 3KxP


we obtain the projection equation by using the low-rank shape
model as:

2IxP 2Ix3K 3KxP


Including the low-rank shape model
Orthographic camera

2IxP 2Ix3K 3KxP


What about the perspective case?

A similar analysis can be followed, but now, considering


homogeneous coordinates. We can obtain:

3IxP 3Ix3K+1 3K+1xP

3x3 3x1
Factorization

In both cases, the goal is to infer the motion factor (P or R) and the
3D coordinates X of the observed non-rigid object from 2D point
tracks in a monocular video W:
a
er
m
ca
ic
ph
ra
og

2IxP 2Ix3K 3KxP


th
Or

a
er
m
ca
ve
cti
pe
rs
Pe

3IxP 3Ix3K+1 3K+1xP


The full linear system

W=MB
Two factors: motion factor M (camera rotation and weight
coefficients) and shape one as a product of B and the coefficients
More on factorization Orthographic camera

Because M is a 2Ix3K matrix and B is a 3KxP matrix, the rank of W


is 3K. If we apply SVD to W, we will have only 3K non-zero
singular values

However, measurements are normally noisy, and in practice the


rank will not be 3K. We have to impose it

Applying SVD factorization, we have:

W ra
e nk
ne K
W= UAVT=[U !][ !VT]=[U !Q][Q-1 !VT]

ed a
to pri
i.e., M=U !Q and B=Q-1 !VT (the two factors we look for)

tu ori
ne
th
e
Many solutions can be achieved by modifying Q. Of course, for all
invertible 3Kx3K Q matrices
Metric Upgrade

How is Q computed?

Enforcing orthogonality constraints on the camera rotation. A


rotation matrix always has some properties (it is not a random
matrix), since lies in the SO(3) manifold

Be careful. Now, matrix M also includes the weight coefficients in


addition to the camera rotations!
But in many cases, we
cannot observe all the
points in all the images
==
Missing tracks
A toy example with missing tracks
Orthographic camera


l l
u

(
f

(
!"" !"%
o t
( #"
'"% '"% '"$ '"&
(
n
!"% !%% !$%
is = #%
t a!
!"$
d a
!%$ $
$ !&$ #$

u
!"&
t !%& !$& !&&
#&
'"& '&% '&$ '&&
p
In 8x4 8x12 12x4
Handling missing tracks
Two alternatives are possible:
§ Applying a matrix completion algorithm to infer the missing
entries, and then run factorization over the full measurement
matrix
§ No consider missing entries in the formulation by applying non-
linear optimization. Once the 3D model and camera pose are
computed, the 2D missing tracks can be inferred too

?
Non-Rigid
Structure from Motion
by Non-Linear Optimization
Problem Statement
For an orthographic camera, we have:

The problem (compacting over the points) can be formulated as:

and we perform non-linear optimization by minimizing a geometric


error cost function. Translation ti is optional
Bundle Adjustment
Normally, the Levenberg-Marquardt method is used to minimize the
problem. We need a Jacobian matrix J as the derivative of the
function with respect to the unknowns (R, B and the set of weight lk)

Again, there are many variants on how to proceed to reduce the


computational complexity of the problem:
§ Alternate minimization of motion and shape parameters
§ Sparse methods. The computation of J is complex, but it can be
approximated by considering a binary pattern

Initialization: The optimization can be initialized assuming a rigid


shape, i.e., using rigid factorization or non-linear optimization for a
rigid shape
Bundle Adjustment
The bundle adjustment method:
§ Minimize the cost function with Levenberg-Marquadt
§ Exploit the sparseness of the Jacobian function matrix to
decrease computation and memory requirements

The Levenberg-Marquadt algorithm does:


§ Mixture of Gauss-Newton and Gradient descent
§ Behaves like Gauss-Newton when close to the minimum
(quadratic region)
§ Gradient descent when the prediction is poor
§ Depends on a parameter θ that controls the mixture of Gauss-
Newton and Gradient descent as:

(JJT+θI) δp = -g Parameters we
want to estimate
Exercise
Let us assume a monocular video of 3 images, where 6 points are
observed. Considering the map is non-rigid and the visibility is full,
define the corresponding Jacobian matrix. A low-rank shape model
of rank 2 can be considered

Number of unknowns
Number of equations

se !
J= p
r
a rn
S tte
pa
Including priors
As in the rigid case, we can apply temporal smoothness priors, but
now, in both camera motion and shape deformation (be careful
when input data are a collection of pictures). To this end, we may
consider the expression:

where Li includes all K weight coefficients in the i-th image


How can we obtain a sequential solution?

We solve the optimization in a sequential manner, considering the


information as the data arrive. Future frames are not available. Two
options:
§ Pure sequential (frame by frame)
§ Sliding window (from 3 to 5 consecutive frames)

Initialization is performed by rigid estimation (assuming just the


initial frames). The problem is actually challenging
An Extension
Semantic 3D Reconstruction
3D Reconstruction of Categories

Unsupervised 3D Reconstruction and Grouping of Rigid and Non-Rigid Categories. Antonio Agudo. IEEE Transactions on
Pattern Analysis and Machine Intelligence (TPAMI), 44(1): 519-532, 2022.
Input Data as Training Data
Shape Basis as a MLP

Neural Dense Non-Rigid Structure from Motion with Latent Space Constraints. Vikramjit Sidhu, Edgar Tretschk, Vladislav
Golyanik, Antonio Agudo, and Christian Theobalt. European Conference on Computer Vision, 2020.
Shape Basis as a MLP
Priors and models can be considered as a loss function in training. For
example, the next energy includes both data term and priors as:

Neural Dense Non-Rigid Structure from Motion with Latent Space Constraints. Vikramjit Sidhu, Edgar Tretschk, Vladislav
Golyanik, Antonio Agudo, and Christian Theobalt. European Conference on Computer Vision, 2020.
Some Results

Neural Dense Non-Rigid Structure from Motion with Latent Space Constraints. Vikramjit Sidhu, Edgar Tretschk, Vladislav
Golyanik, Antonio Agudo, and Christian Theobalt. European Conference on Computer Vision, 2020.
Neural Radiance Fields in
the non-rigid context
Dynamic Neural Radiance Fields

4DPV: 4D Pet from Videos by Coarse-to-fine Non-Rigid Radiance Fields. Sergio M. de Paco and Antonio Agudo. Asian
Conference on Computer Vision, 2024.
Coarse-to-fine Shapes from Videos
Demo

4DPV: 4D Pet from Videos by Coarse-to-fine Non-Rigid Radiance Fields. Sergio M. de Paco and Antonio Agudo. Asian
Conference on Computer Vision, 2024.
Things to remember

3D and 4D information can be obtained from a sequence of images

For rigid objects, the problem is well-posed. For non-rigid ones, it is


inherently ill-posed (additional priors are necessary)

Model-based approaches can handle a wide variety of


deformations. They are normally universal and generic. No
supervision is needed

Data-based approaches require a lot of data to constrain the


solution space. Obtaining *good* data can become hard. Only for a
particular object or deformation (depending on the training data)

Future must be unsupervised (or self-supervised), and probably,


combining both model- and data-based approaches. With a hand-
held camera, performing the estimation of multiple scenarios
Acknowledgments

Thanks to Kris Kitani, Yaser Sheikh, Alessio del Bue, Lourdes


Agapito, Sergio M. de Paco

You might also like