0% found this document useful (0 votes)
5 views

Stereo_Matching_and_Rectification

Stereo_Matching_and_Rectification

Uploaded by

5bnvpv9db4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Stereo_Matching_and_Rectification

Stereo_Matching_and_Rectification

Uploaded by

5bnvpv9db4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Stereo Matching and rectification

January 6, 2025

1 Camera Model
• World coordinates → (Coordinate transformation) → Camera coordinates → (Perspective Projection)
→ Image Coordinates

Calibrate camera intrinsic and extrinsic parameters to accurately transform and project the world coor-
dinates to image coordinates.

1.1 Epipolar Geometry


• Epipolar line: The intersection of the epipolar plane (the plane defined by the camera centers and a
3D point) with the image plane.
• Epipole: The projection of the other camera center (e.g., o′ in the left camera image and o in the
right camera image) onto the image plane.
• Epipolar constraint: If x is a point in the first image, its corresponding point x′ in the second image
must lie on the epipolar line l′ . This reduces the search for correspondences to a 1D problem.

1.2 Fundamental Matrix



x′ Fx = 0 (1)

• Essential matrix E operates on points in the camera coordinate system (3D points projected to 2D
camera coordinates).
E = K′⊤ Rt× K (2)

• Fundamental matrix F operates on points in the image (pixel) coordinate system.

F = K′−⊤ EK−1 (3)

• Both E and F are rank 2 matrices, but they have different properties:
– E has two equal non-zero singular values.
– F does not have equal singular values.

• Degrees of Freedom (DoF):


– E has 5 DoF (3 for rotation, 2 for translation up to scale).
– F has 7 DoF (as it includes the intrinsic parameters of the cameras).

1
1.3 Big Picture: 3 Key Components in 3D
1. 3D Points (Structure): The spatial arrangement and coordinates of points in the 3D scene.
2. Estimate Fundamental Matrix: Determine the relationship between two images of the same scene
to find the corresponding points.

3. Correspondences → Camera (Motion): Use the matched points to determine the relative motion
(rotation and translation) between the two cameras.

1.4 (Normalized) Eight-Point Algorithm


1. Normalize points: Transform
√ the image points so that they are centered around the origin and have
an average distance of 2 from the origin.

x̂ = Tx, x̂′ = T′ x′ (4)

2. Construct the M × 9 matrix A: Each row of A is constructed from a pair of corresponding points
(x̂, x̂′ ).
 ˆ′
xˆ1 x1 xˆ1 yˆ1′ xˆ1 yˆ1 xˆ′1 yˆ1 yˆ1′ yˆ1 xˆ′1 yˆ1′

1
 . .. .. .. .. .. .. .. .. 
A =  .. . . . . . . . . (5)
xˆn xˆ′n xˆn yˆn′ xˆn yˆn xˆ′n yˆn yˆn′ yˆn xˆ′n yˆn′ 1

3. Find the SVD of A: Perform Singular Value Decomposition to find the matrix V .

A = U DV ⊤ (6)

4. Extract F : The entries of F are the elements of the column of V corresponding to the smallest
singular value.
5. Enforce rank-2 constraint on F : Modify F to ensure it has rank 2 by setting its smallest singular
value to zero.

6. Un-normalize F : Transform F back using the normalization matrices.

F = T′⊤ F̂T (7)

1.5 Triangulation
• Correspondences → 3D Points (Structure): Given a set of corresponding points in two images,
triangulate to find their 3D coordinates.
• Camera (Motion) → 3D Points (Structure): Use the known camera positions and orientations
to reconstruct the 3D scene from the image points.

2 Basic Two-View Stereo Setup


• Camera (Motion) → Stereo Matching → Correspondences

2.1 Reconstruction of 3D Points


1. Select a point in one image with feature detection (e.g., SIFT):
• Use Scale-Invariant Feature Transform (SIFT) to detect and describe local features in the image.
2. Form the epipolar line for that point in the second image:

2
• Calibrate the cameras to find the Essential matrix E or the Fundamental matrix F .
• Use F to compute the epipolar line l′ in the second image.
3. Find the matching point along the epipolar line (Stereo matching):
• Search along the epipolar line for the corresponding point using a matching criterion (e.g., nor-
malized cross-correlation).
4. Perform triangulation:
• Use the corresponding points from both images to compute the 3D coordinates of the point via
triangulation.

Disadvantages of this procedure:


• Example: Parallel to Image Plane
– When the camera motion is parallel to the image plane, the epipoles are at infinity.
– In this case, the epipolar lines are parallel in both images.
• Epipoles at Infinity:
– When epipoles are infinitely far away, the epipolar lines are parallel and do not converge.
• Depth Map:
– The amount of horizontal movement of corresponding points is inversely proportional to the
distance from the camera, resulting in a depth map.
– Points closer to the camera move more, while points further away move less.

2.2 Depth from Disparity


Disparity is inversely proportional to depth. The relationship between disparity and depth is given by:
fB
z= (8)
x − x′
where:
• z is the depth (distance from the camera).
• f is the focal length of the camera.
• B is the baseline (distance between the two camera centers).
• x and x′ are the x-coordinates of the corresponding points in the left and right images, respectively.

2.3 Can I Compute Depth from Any Two Images of the Same Object?
To accurately compute depth from two images of the same object, the following conditions must be met:

1. Sufficient Baseline:
• There must be a sufficient distance between the two camera positions to observe a noticeable
disparity.
• A larger baseline improves depth accuracy but can also make matching points more challenging.
2. Rectified Images:
• The images need to be rectified, which means transforming them such that the epipolar lines are
horizontal and aligned.
• Rectification simplifies the search for correspondences to a 1D problem along horizontal lines.

3
2.4 Effect of Baseline on Stereo Results
• Large Baseline:
– Advantages: Smaller triangulation error, leading to more accurate depth estimates.
– Disadvantages: Matching points between images becomes more difficult due to greater variation
in perspective.
• Small Baseline:
– Advantages: Easier to match points between images due to less variation in perspective.
– Disadvantages: Higher triangulation error, leading to less accurate depth estimates.

2.5 Steps to Compute Depth from Disparity


1. Rectify Images:
• Transform the images such that the epipolar lines are horizontal, simplifying the correspondence
search.
2. For Each Pixel:
(a) Find the epipolar line in the rectified image.
(b) Scan along the epipolar line to find the best match for the pixel.
(c) Compute depth from the disparity using the formula:
bf
Z= (9)
d
where Z is the depth, b is the baseline, f is the focal length, and d is the disparity.

2.6 How to Make the Epipolar Lines Horizontal


Epipolar lines are horizontal when the rotation matrix R = I and the translation vector t = (T, 0, 0), meaning
the cameras are aligned such that their optical axes are parallel and the translation is purely along the x-axis.
• In rectified images, corresponding points lie on the same row. This alignment ensures that the epipolar
lines are horizontal, simplifying the process of finding correspondences.
• The rectification process involves transforming the images using homographies that align the epipolar
lines.

2.7 Stereo Image Rectification


If the image planes are not parallel, we can find homographies to project each view onto a common plane
parallel to the baseline.

2.8 Image Rectification


To rectify an image, we calculate a rectifying rotation Rrect = (r1 , r2 , r3 )T , with:
r1 = −RT t/∥RT t∥2
[(0, 0, 1)T ] × r1
r2 =
∥[(0, 0, 1)T ] × r1 ∥2
r3 = r1 × r2
As the epipole in the first image is in the direction of r1 , it is easy to see that the rotated epipole is ideal:
Rrect r1 = (1, 0, 0)T . Thus, applying Rrect to the first camera leads to parallel and horizontal epipolar lines.
Rectification Algorithm:

4
1. Estimate Ẽ, decompose into t and R, and construct Rrect as above.
2. Warp pixels in the first image as follows:

x̃′1 = KRrect K1−1 x1 (10)

3. Warp pixels in the second image as follows:

x̃′2 = KRrect RT K2−1 x2 (11)

Remarks:

• K is a shared projection matrix that can be chosen arbitrarily (e.g., K = K1 ).


• In practice, the inverse transformation is used for warping (i.e., query the source image using the
inverse of the computed transformation).
Stereo Image Rectification: Correspondences are located on the same image row as the query point.

2.9 Depth Estimation via Stereo Matching


1. Rectify images: Transform the images such that the epipolar lines are horizontal.
2. For each pixel:

(a) Find the epipolar line in the rectified image.


(b) Scan along the epipolar line to find the best match for the pixel. This can be done using various
matching criteria, such as sum of absolute differences (SAD), sum of squared differences (SSD),
or normalized cross-correlation (NCC).
(c) Compute depth from the disparity using the formula:

bf
z= (12)
d
where:
• z is the depth.
• b is the baseline (distance between the two cameras).
• f is the focal length of the camera.
• d is the disparity (difference in x-coordinates of the corresponding points in the rectified
images).

3 Local Stereo Matching Algorithm


3.1 Matching Using Epipolar Lines
For a patch in the left image:
• Compare it with patches along the same row in the right image.
• Select the patch with the highest match score.

• Repeat for all pixels in the left image.

5
3.2 Similarity Measure
Commonly used similarity measures for evaluating match scores include:
• Sum of Absolute Differences (SAD):
X
SAD = |IL (u, v) − IR (u + d, v)| (13)
(u,v)∈window

• Sum of Squared Differences (SSD):


X
SSD = (IL (u, v) − IR (u + d, v))2 (14)
(u,v)∈window

• Zero-mean SAD:
X
Zero-mean SAD = |(IL (u, v) − I¯L ) − (IR (u + d, v) − I¯R )| (15)
(u,v)∈window

• Locally Scaled SAD:


X IL (u, v) IR (u + d, v)
Locally Scaled SAD = − (16)
I¯L I¯R
(u,v)∈window

• Normalized Cross-Correlation (NCC):


− I¯L )(IR (u + d, v) − I¯R )
P
(u,v)∈window (IL (u, v)
NCC = qP (17)
¯ 2 P ¯ 2
(u,v)∈window (IL (u, v) − IL ) (u,v)∈window (IR (u + d, v) − IR )

3.3 Window Size


Adaptive window methods can help balance detail and noise:
• For each point, match using windows of multiple sizes and use the disparity that results in the best
similarity measure (minimize SSD per pixel).
Smaller Window:
• Advantages: More detail in the disparity map.
• Disadvantages: More noise in the disparity map.
Larger Window:
• Advantages: Smoother disparity map.
• Disadvantages: Less detail, and can fail near boundaries and discontinuities.

3.4 Block Matching


• Choose a disparity range [0, D].
• For all pixels x = (x, y) in the left image, compute the best disparity using a winner-takes-all (WTA)
approach.
• Repeat this process for the right image.
• Apply a left-right consistency check to remove outliers.
Half Occlusions:
• An area that is visible in the left image but not in the right image.

6
3.5 Disparity Space Image (DSI)
First, we introduce the concept of the Disparity Space Image (DSI). The DSI for one row represents pairwise
match scores between patches along that row in the left and right images.

• c(i, j) is the match score for the patch centered at left pixel i with the patch centered at right pixel j.
• The dissimilarity value for each pair of patches is entered as a column in the DSI.

Greedy Selection:
• Simply choose the row with the least disparity for each column.

3.6 Greedy Per-Pixel Path Matching


Greedy selection often does not satisfy order constraints and produces a non-smooth disparity map.

4 Beyond Local Stereo Matching


4.1 Why is Matching Challenging?
• Uniqueness: Each point in one image should match at most one point in the other image.
• Smoothness: We expect disparity to change slowly across most of the image, resulting in smooth
disparity maps.

• Occlusion:
– What if a pixel in the left image is not seen in the right image?
– What if a pixel in the right image is not seen in the left image?
• Ordering Constraint: If pixels (a, b, c) are ordered in the left image, they should have the same
order in the right image.

– This is not always true and depends on the depth of the objects in the scene.

4.2 Non-Local Constraint: Uniqueness


Each point in one image should match at most one point in the other image. However, in real life, uniqueness
does not always hold due to:
• Repetitive textures: Similar patterns can appear in multiple locations, leading to ambiguous matches.
• Reflective surfaces: Reflections can cause mismatches as they show different scenes in the two images.

• Transparency: Overlapping objects can produce multiple potential matches.

4.3 Non-Local Constraint: Smoothness


We expect disparity values to change slowly for the most part, resulting in smooth disparity maps. However,
exceptions include:
• Depth discontinuities: Sharp changes in depth, such as object edges, result in abrupt disparity changes.
• Textureless regions: Large uniform areas might produce unreliable disparity values.

7
4.4 Occlusions: No Matches
Dealing with occlusions:
• Identify occluded regions: Use left-right consistency checks to detect pixels that do not have corre-
sponding matches in the other image.

• Fill occlusions: Interpolate disparity values from neighboring non-occluded regions.


• Use visibility constraints: Model occlusions explicitly in the matching process to improve accuracy.

4.5 Left-Right Consistency Test


Outliers and half occlusions can be detected via a left-right consistency test:
• Compute the disparity map for both the left and right images.
• Verify if the disparities map to each other (cycle consistency). Specifically, check if the disparity of
a pixel in the left image matches the disparity of the corresponding pixel in the right image and vice
versa.
• Pixels that fail this consistency test are likely to be outliers or occlusions.

4.6 Adding Inter-Scanline Consistency


So far, each left image patch has been matched independently along the right epipolar line. This approach
can lead to errors due to lack of consistency within the scanline.
To enforce consistency among matches in the same row (scanline):
• Consider the spatial relationship between neighboring pixels along the scanline.

• Apply constraints or smoothing techniques to ensure that the disparity values change smoothly along
the scanline, except at depth discontinuities.
• Use dynamic programming or other optimization techniques to find a consistent set of disparities that
minimize a global energy function incorporating both matching costs and smoothness constraints.

4.7 DSI and Scanline Consistency


Assigning disparities to all pixels in the left scanline now amounts to finding a connected path through the
Disparity Space Image (DSI).

4.8 Lowest Cost Path


We aim to choose the ”best” path, one with the lowest ”cost” (lowest sum of dissimilarity scores along the
path).

4.9 Stereo Matching with Dynamic Programming


Dynamic programming yields the optimal path through the grid. This path represents the best set of matches
that satisfy the ordering constraint. There are three cases:
• Matching patches: The cost is the dissimilarity score.

• Occluded from the right: The cost is some constant value.


• Occluded from the left: The cost is some constant value.

8
4.10 Real Scanline Example
Every pixel in the left column is now marked with either a disparity value or an occlusion label. This process
is repeated for every scanline in the left image.

4.11 Occlusion Filling


A simple trick for filling in gaps caused by occlusion:
• Fill in left occluded pixels with values from the nearest valid pixel preceding it in the scanline.
• For right occluded pixels, look for a valid pixel to the right.

4.12 Scanline Stereo by Dynamic Programming


This method often generates streaking artifacts due to the local nature of the matching process.

4.13 Improving Depth Estimation


• Issue: Too many discontinuities in the disparity map.
• Expectation: We expect disparity values to change slowly across most of the image.
• Assumption: Depth should change smoothly, meaning neighboring pixels should have similar dispar-
ity values.

To address this:
• Incorporate smoothness constraints into the stereo matching process.
• Penalize large changes in disparity values between neighboring pixels.
• Use techniques such as regularization or optimization algorithms to enforce smoothness in the disparity
map.

4.14 Energy Minimization


What defines a good stereo correspondence?

1. Match Quality:
• We want each pixel to find a good match in the other image.
• The disparity value should accurately represent the corresponding point in the other image.
2. Smoothness:
• If two pixels are adjacent, they should usually move about the same amount.
• Disparity values should change smoothly across the image, except at depth discontinuities.

To achieve these objectives, we can formulate the stereo matching problem as an energy minimization
task. The energy function combines terms for match quality and smoothness, and the goal is to find the
disparity map that minimizes this energy. Optimization algorithms like graph cuts, belief propagation, or
variational methods can be used to find the optimal solution.
subsectionStereo as Energy Minimization
In stereo vision, we can view the matching problem as an energy minimization task, where the goal is to
find the disparity map that minimizes an energy function. The energy function is typically defined as:

E(d) = Ed (d) + λEs (d) (18)


where:

9
• E(d) is the energy function for one pixel.
• Ed (d) is the data term, representing the match quality. It ensures that each pixel finds a good match
in the other image, typically obtained from block matching results.
• Es (d) is the smoothness term, representing the smoothness constraint. It ensures that adjacent pixels
usually move about the same amount, promoting smooth disparity maps.
• λ is a weighting parameter that balances the importance of the data term and the smoothness term.
The task is to find the disparity map d that minimizes this energy function, typically achieved using
optimization algorithms such as graph cuts, belief propagation, or variational methods.

4.15 Dynamic Programming


Dynamic programming (DP) can be used to minimize the energy function independently per scanline. Here,
D(x, y, z) represents the minimum cost of the solution such that d(x, y) = z, where:
• (x, y) represents the pixel coordinates.
• z represents the disparity value.
DP iterates over each pixel in the scanline and computes the minimum cost of the solution for each
possible disparity value. This process is repeated for all pixels in the scanline, efficiently finding the optimal
disparity map that minimizes the energy function.

4.16 Energy Minimization via Graph Cut Algorithm


The energy minimization task in stereo vision can be solved using the graph cut algorithm. This algorithm
finds the optimal disparity map by partitioning a graph representing the problem into two disjoint sets
(source and sink), such that the cut minimizes the total energy of the system.

4.17 Stereo Block Matching Fail


Stereo block matching may fail in scenarios such as:
• Textureless regions: Lack of texture makes it difficult to find distinctive features for matching.
• Repeated patterns: Identical or similar patterns in both images can lead to ambiguous matches.
• Specularities: Reflections or shiny surfaces can distort image appearance, causing mismatches.

4.18 Stereo Reconstruction Pipeline


The stereo reconstruction pipeline typically involves the following steps:
1. Camera Calibration: Determine camera parameters and correct distortions.
2. Rectify Images: Transform images to align corresponding epipolar lines.
3. Compute Disparity: Estimate the pixel-wise disparity between rectified image pairs.
4. Estimation Depth: Calculate depth from the disparity map using known camera parameters.
What Will Cause Errors?
• Camera Calibration Errors: Inaccurate camera parameters can lead to incorrect disparity estima-
tion.
• Poor Image Resolution: Low-resolution images may not contain sufficient detail for accurate match-
ing.

10
• Occlusions: Objects blocking the view can cause missing or incorrect depth information.
• Violations of Brightness Constancy (Specular Reflections): Changes in illumination or specular
reflections can violate the assumption of brightness constancy, leading to errors in matching.
• Large Motions: Rapid movements between frames can cause motion blur and mismatches.

• Low Contrast Image Resolutions: Images with low contrast may not provide enough information
for reliable matching.

4.19 Siamese Network for Stereo Matching


The Siamese network approach for stereo matching involves the following steps:

1. Training CNN Path-wise: Train a convolutional neural network (CNN) on pairs of stereo images
along with their ground truth disparity maps. The CNN is trained to learn features that are useful for
matching corresponding pixels between the left and right images.
2. Calculate Features: Once the CNN is trained, use it to extract features for each pixel in both the
left and right images.

3. Correlate Features: Correlate the features between the left and right images, typically using the dot
product operation.
4. Disparity Estimation:
• Winner Takes All (WTA): For each pixel, find the maximum correlated value, indicating the
best match in the other image. This approach is known as winner takes all.
• Global Optimization: Alternatively, run a global optimization algorithm to refine the disparity
estimation across the entire image, considering contextual information and enforcing consistency
constraints.

By leveraging deep learning techniques and training on large datasets, Siamese networks can learn complex
features and improve stereo matching accuracy compared to traditional methods.

4.20 Stereo Matching with Deep Networks


One notable approach in stereo matching using deep networks is DispNet. Here are its key features:

• DispNet: DispNet was one of the first end-to-end trained deep neural networks for stereo matching.
• Architecture: DispNet employs a U-Net like architecture with skip connections to retain details and
capture multi-scale features effectively.
• Correlation Layer: DispNet utilizes a correlation layer, which computes the similarity between
patches at different displacements, typically with a large displacement range compared to the input
image size.
• Multi-Scale Loss: DispNet employs a multi-scale loss function, which considers disparity errors at
multiple scales in the image pyramid. This helps in capturing disparities accurately across different
levels of details.

• Curriculum Learning: DispNet incorporates curriculum learning, where the network is trained on
a curriculum of increasingly difficult examples. It starts by learning from easy examples and gradually
progresses to harder ones, which helps in faster convergence and better generalization.

By leveraging deep neural networks like DispNet, stereo matching algorithms can achieve state-of-the-art
performance in terms of accuracy and efficiency.

11
4.21 Stereo Mixture Density Networks (SMD-Nets)
Stereo Mixture Density Networks (SMD-Nets) are a variant of deep neural networks specifically designed
for stereo matching tasks. One of their key features is the ability to predict sharper boundaries at higher
resolution compared to traditional methods.
SMD-Nets leverage mixture density models to capture complex relationships between stereo image pairs
and their corresponding disparities. By using mixture density models, SMD-Nets can represent multimodal
distributions, allowing them to predict not only the most likely disparity value for each pixel but also the
uncertainty associated with the prediction.
This ability to model uncertainty is particularly useful in stereo matching, where the correspondence
between pixels in stereo image pairs may be ambiguous or uncertain. By predicting sharper boundaries,
SMD-Nets can provide more accurate and reliable depth estimates, especially in challenging scenarios such
as textureless regions, occlusions, or depth discontinuities.
Overall, SMD-Nets represent a promising approach to stereo matching, offering improved accuracy and
robustness by explicitly modeling uncertainty and predicting sharper boundaries.

4.22 Stereo Datasets


Several stereo datasets are commonly used for training and evaluating stereo matching algorithms. Some of
the popular datasets include:

• Middlebury Stereo Datasets: Middlebury provides a benchmark dataset consisting of stereo im-
age pairs along with ground truth disparity maps. These datasets cover a wide range of scenes and
variations in lighting, texture, and occlusions.
• KITTI: The KITTI dataset is widely used for evaluating stereo algorithms in the context of au-
tonomous driving. It includes stereo image pairs captured from vehicles equipped with stereo cameras,
along with accurate ground truth annotations for depth and motion.
• Synthetic Data: Synthetic datasets such as Flying Things and Monkaa are generated using computer
graphics rendering techniques. These datasets provide stereo image pairs with accurate ground truth
disparities, allowing researchers to train and evaluate stereo matching algorithms under controlled
conditions and diverse environments.

These datasets play a crucial role in the development and evaluation of stereo matching algorithms,
enabling researchers to compare the performance of different methods and assess their generalization capa-
bilities across various real-world scenarios.

5 Active Stereo with Structured Light


Active stereo with structured light involves projecting structured light patterns onto the object to simplify
the correspondence problem in stereo vision. This technique forms the basis for active depth sensors such as
Kinect and iPhone X (using IR).
By using controlled structured light, the correspondence problem becomes easier to solve. The disparity
between laser points on the same scanline in the images determines the 3D coordinates of the laser point on
the object.

5.1 Laser Scanning


Laser scanning utilizes optical triangulation to capture precise 3D information about the surface of an object.
In this method, a single stripe of laser light is projected onto the object, and as it scans across the surface,
the reflected light is captured by a sensor. This results in a highly accurate representation of the object’s
geometry.

12
5.2 Aligning Range Images
A single range scan may not be sufficient to capture the complete surface of a complex object. Therefore,
techniques are required to register multiple range images obtained from different viewpoints. This process
involves aligning the range images to create a coherent 3D representation of the object. This aligning process
is crucial for further analysis and processing of the captured data.
This alignment of range images leads us to the field of multi-view stereo, where the goal is to reconstruct
the 3D geometry of a scene by combining information from multiple viewpoints.

13

You might also like