Stereo_Matching_and_Rectification
Stereo_Matching_and_Rectification
January 6, 2025
1 Camera Model
• World coordinates → (Coordinate transformation) → Camera coordinates → (Perspective Projection)
→ Image Coordinates
Calibrate camera intrinsic and extrinsic parameters to accurately transform and project the world coor-
dinates to image coordinates.
• Essential matrix E operates on points in the camera coordinate system (3D points projected to 2D
camera coordinates).
E = K′⊤ Rt× K (2)
• Both E and F are rank 2 matrices, but they have different properties:
– E has two equal non-zero singular values.
– F does not have equal singular values.
1
1.3 Big Picture: 3 Key Components in 3D
1. 3D Points (Structure): The spatial arrangement and coordinates of points in the 3D scene.
2. Estimate Fundamental Matrix: Determine the relationship between two images of the same scene
to find the corresponding points.
3. Correspondences → Camera (Motion): Use the matched points to determine the relative motion
(rotation and translation) between the two cameras.
2. Construct the M × 9 matrix A: Each row of A is constructed from a pair of corresponding points
(x̂, x̂′ ).
ˆ′
xˆ1 x1 xˆ1 yˆ1′ xˆ1 yˆ1 xˆ′1 yˆ1 yˆ1′ yˆ1 xˆ′1 yˆ1′
1
. .. .. .. .. .. .. .. ..
A = .. . . . . . . . . (5)
xˆn xˆ′n xˆn yˆn′ xˆn yˆn xˆ′n yˆn yˆn′ yˆn xˆ′n yˆn′ 1
3. Find the SVD of A: Perform Singular Value Decomposition to find the matrix V .
A = U DV ⊤ (6)
4. Extract F : The entries of F are the elements of the column of V corresponding to the smallest
singular value.
5. Enforce rank-2 constraint on F : Modify F to ensure it has rank 2 by setting its smallest singular
value to zero.
1.5 Triangulation
• Correspondences → 3D Points (Structure): Given a set of corresponding points in two images,
triangulate to find their 3D coordinates.
• Camera (Motion) → 3D Points (Structure): Use the known camera positions and orientations
to reconstruct the 3D scene from the image points.
2
• Calibrate the cameras to find the Essential matrix E or the Fundamental matrix F .
• Use F to compute the epipolar line l′ in the second image.
3. Find the matching point along the epipolar line (Stereo matching):
• Search along the epipolar line for the corresponding point using a matching criterion (e.g., nor-
malized cross-correlation).
4. Perform triangulation:
• Use the corresponding points from both images to compute the 3D coordinates of the point via
triangulation.
2.3 Can I Compute Depth from Any Two Images of the Same Object?
To accurately compute depth from two images of the same object, the following conditions must be met:
1. Sufficient Baseline:
• There must be a sufficient distance between the two camera positions to observe a noticeable
disparity.
• A larger baseline improves depth accuracy but can also make matching points more challenging.
2. Rectified Images:
• The images need to be rectified, which means transforming them such that the epipolar lines are
horizontal and aligned.
• Rectification simplifies the search for correspondences to a 1D problem along horizontal lines.
3
2.4 Effect of Baseline on Stereo Results
• Large Baseline:
– Advantages: Smaller triangulation error, leading to more accurate depth estimates.
– Disadvantages: Matching points between images becomes more difficult due to greater variation
in perspective.
• Small Baseline:
– Advantages: Easier to match points between images due to less variation in perspective.
– Disadvantages: Higher triangulation error, leading to less accurate depth estimates.
4
1. Estimate Ẽ, decompose into t and R, and construct Rrect as above.
2. Warp pixels in the first image as follows:
Remarks:
bf
z= (12)
d
where:
• z is the depth.
• b is the baseline (distance between the two cameras).
• f is the focal length of the camera.
• d is the disparity (difference in x-coordinates of the corresponding points in the rectified
images).
5
3.2 Similarity Measure
Commonly used similarity measures for evaluating match scores include:
• Sum of Absolute Differences (SAD):
X
SAD = |IL (u, v) − IR (u + d, v)| (13)
(u,v)∈window
• Zero-mean SAD:
X
Zero-mean SAD = |(IL (u, v) − I¯L ) − (IR (u + d, v) − I¯R )| (15)
(u,v)∈window
6
3.5 Disparity Space Image (DSI)
First, we introduce the concept of the Disparity Space Image (DSI). The DSI for one row represents pairwise
match scores between patches along that row in the left and right images.
• c(i, j) is the match score for the patch centered at left pixel i with the patch centered at right pixel j.
• The dissimilarity value for each pair of patches is entered as a column in the DSI.
Greedy Selection:
• Simply choose the row with the least disparity for each column.
• Occlusion:
– What if a pixel in the left image is not seen in the right image?
– What if a pixel in the right image is not seen in the left image?
• Ordering Constraint: If pixels (a, b, c) are ordered in the left image, they should have the same
order in the right image.
– This is not always true and depends on the depth of the objects in the scene.
7
4.4 Occlusions: No Matches
Dealing with occlusions:
• Identify occluded regions: Use left-right consistency checks to detect pixels that do not have corre-
sponding matches in the other image.
• Apply constraints or smoothing techniques to ensure that the disparity values change smoothly along
the scanline, except at depth discontinuities.
• Use dynamic programming or other optimization techniques to find a consistent set of disparities that
minimize a global energy function incorporating both matching costs and smoothness constraints.
8
4.10 Real Scanline Example
Every pixel in the left column is now marked with either a disparity value or an occlusion label. This process
is repeated for every scanline in the left image.
To address this:
• Incorporate smoothness constraints into the stereo matching process.
• Penalize large changes in disparity values between neighboring pixels.
• Use techniques such as regularization or optimization algorithms to enforce smoothness in the disparity
map.
1. Match Quality:
• We want each pixel to find a good match in the other image.
• The disparity value should accurately represent the corresponding point in the other image.
2. Smoothness:
• If two pixels are adjacent, they should usually move about the same amount.
• Disparity values should change smoothly across the image, except at depth discontinuities.
To achieve these objectives, we can formulate the stereo matching problem as an energy minimization
task. The energy function combines terms for match quality and smoothness, and the goal is to find the
disparity map that minimizes this energy. Optimization algorithms like graph cuts, belief propagation, or
variational methods can be used to find the optimal solution.
subsectionStereo as Energy Minimization
In stereo vision, we can view the matching problem as an energy minimization task, where the goal is to
find the disparity map that minimizes an energy function. The energy function is typically defined as:
9
• E(d) is the energy function for one pixel.
• Ed (d) is the data term, representing the match quality. It ensures that each pixel finds a good match
in the other image, typically obtained from block matching results.
• Es (d) is the smoothness term, representing the smoothness constraint. It ensures that adjacent pixels
usually move about the same amount, promoting smooth disparity maps.
• λ is a weighting parameter that balances the importance of the data term and the smoothness term.
The task is to find the disparity map d that minimizes this energy function, typically achieved using
optimization algorithms such as graph cuts, belief propagation, or variational methods.
10
• Occlusions: Objects blocking the view can cause missing or incorrect depth information.
• Violations of Brightness Constancy (Specular Reflections): Changes in illumination or specular
reflections can violate the assumption of brightness constancy, leading to errors in matching.
• Large Motions: Rapid movements between frames can cause motion blur and mismatches.
• Low Contrast Image Resolutions: Images with low contrast may not provide enough information
for reliable matching.
1. Training CNN Path-wise: Train a convolutional neural network (CNN) on pairs of stereo images
along with their ground truth disparity maps. The CNN is trained to learn features that are useful for
matching corresponding pixels between the left and right images.
2. Calculate Features: Once the CNN is trained, use it to extract features for each pixel in both the
left and right images.
3. Correlate Features: Correlate the features between the left and right images, typically using the dot
product operation.
4. Disparity Estimation:
• Winner Takes All (WTA): For each pixel, find the maximum correlated value, indicating the
best match in the other image. This approach is known as winner takes all.
• Global Optimization: Alternatively, run a global optimization algorithm to refine the disparity
estimation across the entire image, considering contextual information and enforcing consistency
constraints.
By leveraging deep learning techniques and training on large datasets, Siamese networks can learn complex
features and improve stereo matching accuracy compared to traditional methods.
• DispNet: DispNet was one of the first end-to-end trained deep neural networks for stereo matching.
• Architecture: DispNet employs a U-Net like architecture with skip connections to retain details and
capture multi-scale features effectively.
• Correlation Layer: DispNet utilizes a correlation layer, which computes the similarity between
patches at different displacements, typically with a large displacement range compared to the input
image size.
• Multi-Scale Loss: DispNet employs a multi-scale loss function, which considers disparity errors at
multiple scales in the image pyramid. This helps in capturing disparities accurately across different
levels of details.
• Curriculum Learning: DispNet incorporates curriculum learning, where the network is trained on
a curriculum of increasingly difficult examples. It starts by learning from easy examples and gradually
progresses to harder ones, which helps in faster convergence and better generalization.
By leveraging deep neural networks like DispNet, stereo matching algorithms can achieve state-of-the-art
performance in terms of accuracy and efficiency.
11
4.21 Stereo Mixture Density Networks (SMD-Nets)
Stereo Mixture Density Networks (SMD-Nets) are a variant of deep neural networks specifically designed
for stereo matching tasks. One of their key features is the ability to predict sharper boundaries at higher
resolution compared to traditional methods.
SMD-Nets leverage mixture density models to capture complex relationships between stereo image pairs
and their corresponding disparities. By using mixture density models, SMD-Nets can represent multimodal
distributions, allowing them to predict not only the most likely disparity value for each pixel but also the
uncertainty associated with the prediction.
This ability to model uncertainty is particularly useful in stereo matching, where the correspondence
between pixels in stereo image pairs may be ambiguous or uncertain. By predicting sharper boundaries,
SMD-Nets can provide more accurate and reliable depth estimates, especially in challenging scenarios such
as textureless regions, occlusions, or depth discontinuities.
Overall, SMD-Nets represent a promising approach to stereo matching, offering improved accuracy and
robustness by explicitly modeling uncertainty and predicting sharper boundaries.
• Middlebury Stereo Datasets: Middlebury provides a benchmark dataset consisting of stereo im-
age pairs along with ground truth disparity maps. These datasets cover a wide range of scenes and
variations in lighting, texture, and occlusions.
• KITTI: The KITTI dataset is widely used for evaluating stereo algorithms in the context of au-
tonomous driving. It includes stereo image pairs captured from vehicles equipped with stereo cameras,
along with accurate ground truth annotations for depth and motion.
• Synthetic Data: Synthetic datasets such as Flying Things and Monkaa are generated using computer
graphics rendering techniques. These datasets provide stereo image pairs with accurate ground truth
disparities, allowing researchers to train and evaluate stereo matching algorithms under controlled
conditions and diverse environments.
These datasets play a crucial role in the development and evaluation of stereo matching algorithms,
enabling researchers to compare the performance of different methods and assess their generalization capa-
bilities across various real-world scenarios.
12
5.2 Aligning Range Images
A single range scan may not be sufficient to capture the complete surface of a complex object. Therefore,
techniques are required to register multiple range images obtained from different viewpoints. This process
involves aligning the range images to create a coherent 3D representation of the object. This aligning process
is crucial for further analysis and processing of the captured data.
This alignment of range images leads us to the field of multi-view stereo, where the goal is to reconstruct
the 3D geometry of a scene by combining information from multiple viewpoints.
13