Ali Real-Time Vehicle Distance Estimation Using Single View Geometry
Ali Real-Time Vehicle Distance Estimation Using Single View Geometry
Ahmed Ali, Ali Hassan, Afsheen Rafaqat Ali, Hussam Ullah Khan, Wajahat Kazmi, Aamer Zaheer
KeepTruckin Inc.
Lahore, Pakistan
{Ahmed.Ali, Ali.Hassan, Afsheen, Hussam.Khan, Wajahat.Kazmi, AZ} @KeepTruckin.com
Abstract
1111
lution RGB images of the scene but will demand intelligent ular imaging and average vehicle width in real world to esti-
post processing using robust algorithms and higher compute mate horizon. Following that, the distance of the bottom of
power on board. the vehicle’s detected bounding box from the horizon pro-
In computer vision, the problem of distance estimation vided a hint of its relative distance. But they assumed an al-
falls in the broader area of scene depth estimation. To this ready known camera height which can be difficult if done on
end, various data-driven and geometric approaches have a multitude of vehicle across a fleet. In 2017, top perform-
been explored in literature. In data driven approaches, ing submission to TuSimple’s velocity estimation challenge
depth of objects is learnt through either training a super- employed deep learning based depth prediction techniques
vised depth regressor [10] or unsupervised disparity estima- combined with object (vehicle) detection to find relative ve-
tor [13, 45]. On the other hand, geometric approaches take locities [37]. Keeping this in view, we will have to look
into account the geometry of the features in the scene such into the research work done in the two domains i.e. our pro-
as Multi-View Stereo (MVS) and Single View Metrology posal of single view geometry for depth estimation and the
[43]. MVS has remained a primary focus for depth recov- state-of-the-art deep learning based depth prediction.
ery in the past [21]. It provides depth for each pixel via
dense matching of more than one image. MVS is limited 2.1. Single view geometry based depth estimation
by the availability of texture in the scene, the efficiency of In computer vision, single view based recovery of scene
the algorithms as well as the complexity of multi-camera geometry have been thoroughly studied [4, 36, 30, 17]. Lei-
configuration. The accuracy of the depth map may not be botwzki used geometric properties such as parallelism and
consistent throughout. Single view metrology, instead, uses orthogonality of structures to recover 3D models from one
a single image to compute 3D measurements using perspec- or two views without explicitly knowing the metric proper-
tive geometry [7]. Being ill-posed, it has remained limited ties of objects [26]. Lee et al. showed that even in the pres-
to images of structured environments where a combination ence of clutter, full 3D model can be recovered from a single
of parallel and orthogonal lines and surfaces are substan- image using geometric reasoning on detected line segment
tially available. In this paper, we use monocular dashcam produced from planar surfaces [24]. Recently, Zaheer et
images to estimate relative distances of vehicles on the road.
al. used orthogonal lines in a multi planar single view, to
For evaluation, we used lidar data as ground truth and es-
recover full 3D reconstruction of the scene [43]. These ap-
tablished a comparison of the data-driven approaches with
proaches are particularly interesting for indoor man-made
the radar data. We propose an efficient and data indepen-
environments or for outdoor scenes with man made struc-
dent algorithm for real-time distance estimation for flat road
tures around. The reason is that human architecture involves
surfaces using single view geometry. With the help of any
a profusion of parallel and orthogonal lines emanating from
available lane and vehicle detection modules (which are in-
planar surfaces.
tegral components of ADAS and self driving cars), it fully
Apart from 3D model recovery from a single view, sub-
exploits the structure in the scene. We will show that it com-
stantial work has also been done in single view metrology.
petes well with deep learning methods and under the geo-
Leibowitz and Zisserman proposed metric rectification of
metric constraints, it outperforms end-to-end deep learning
planes from single perspective view and used it to measure
based methods as well as the radar data both in accuracy and
angles, lengths and ratios under given constraints of one
efficiency making it suitable for edge devices. Since the au-
known angle, two unknown but equal angles and a known
tonomous driving is currently limited to high-way driving
length ratio [27]. Zaheer and Khan showed that projectively
scenarios, it is well suited for such tasks. And except for
distorted views of man-made 3D structures can be metric
the camera initialization phase, which is a one-time compu-
rectified using orthogonal plane-pairs [42]. Liu et al. esti-
tation, it comes at almost no computational expense.
mated depth from a single view given semantic segmenta-
The breakdown of the paper is as follows: Section 2 pro-
tion under geometric constraints [28]. On this topic, Crim-
vides a quick review of depth estimation literature. Section
inisi et al. explain single and multi view metric reconstruc-
3 discusses our approach in detail with ADAS modules for
tions in detail, especially determining camera pose, com-
lane and vehicle detection in section 3.1, camera initializa-
puting distance between planes parallel to a reference plane
tion in section 3.2 and distance estimation in section 3.3.
and also finding area and length ratios on any plane parallel
Section 4 outlines the datasets, training and testing. Results
to the reference plane [7, 6].
and a discussion on the results are presented in section 4.1
and section 5 concludes the paper. 2.2. Data-driven depth prediction
1112
Video frames
gression to estimate pixel wise depth map avoiding compli- Yes Initialization Block
Camera
Horizon IPM Image Parallel line
cations of stereo calibration [35]. Saxena et al. used monoc- Lane boundary
marker detection estimation computation rectification fitting
height
estimation
ular images of a variation of unstructured outdoor scenes Camera height, road plane normal
for training Markov Random Fields (MRFs) on both local If Post Initialization Block
initialization Vehicle
and global features for the recovery of fairly accurate depth Distance
phase bounding
estimation
No box
maps [34]. Ladicky et al. designed a stereo matching classi-
fier for predicting likelihood of matching pixel in one image
offset by disparity in second image [23]. Figure 2. Block diagram of our algorithm.
Owing to little success, the data-driven approaches for
depth estimation could not acquire a lot of attention until mation as it may change from time to time. Instead, we
the advent of deep learning. Deep learning based depth es- propose an automated way of finding camera height using
timation made a huge impact in terms of holistic and ac- horizon through cross-ratios of parallel lines [15]. Deter-
curate depth prediction so much so that a significant num- mination of camera height completes camera initialization
ber of contributions have surfaced over a short span of time which only needs to be done once for each rigid camera as-
[40, 46, 25, 41, 45]. Žbontar and LeCunn trained a small sembly. Road plane is then reconstructed and the distance
convolutional neural network for matching stereo pairs in- from other vehicles on the road can be calculated through
stead of handcrafted feature matching and showed results simple math. A block diagram of our full pipeline is shown
better in terms of both the accuracy of depth maps as well in Figure 2. For camera initialization, we assume a straight
as efficiency [44]. Eigen et al. used two deep networks in and flat road surface. Lane boundaries and vehicles can be
cascade [8] to predict high-level global depth followed by detected using any off-the-shelf lane and vehicle detection
refinement to low-level local depth in a supervised regres- algorithms. In the following section, we will now briefly de-
sion framework. This approach acquired a lot of traction scribe lane and vehicle detectors that are used in evaluation
and several improvements followed such as adding Condi- for distance estimation in this paper.
tional Random Fields (CRFs) by Li et al. [39]. Cao et al.
transformed monocular depth estimation from regression to 3.1. Lane and vehicle detection
a supervised classification problem [5]. Fu et al. used an or-
dinal regression loss in supervised training to eliminate slow To exploit road geometry, detection of lane boundaries
convergence during training and sub-optimal minimas [10]. is a crucial step for which we used the lane detection
Their method is the best performing one on Kitti benchmark network by Khan et al. [20] which uses a fully convolu-
[2] with iRMSE 12.98 (1/km), RMSE 2.6 (m). tional deep network for detecting keypoints sampled along
Supervised learning requires ground truth depth data the lane boundaries. The network architecture diagram
which in case of deep learning becomes challenging to ac- is shown in Figure 3(a). It consists of an encoder block
quire. Therefore, recently, some unsupervised approaches (ResNet50 [16] in this case), responsible for feature extrac-
have appeared that have performed exceptionally well. tion, and a decoder block that learns the relationship among
Garg et al. proposed a convolutional encoder network in extracted features and makes decision on top of it. The de-
which they use one of the stereo-pair style images and pre- coder block comprised of four convolutional layers, a single
dict a disparity map [11]. An inverse warp of the sec- transpose-convolution layer (that upsamples feature map by
ond image is then produced from the known inter-view dis- 4x) followed by the output convolution layer. In the output
placement and the disparity map, which is then used to re- layer, just like Mask-RCNN, one-hot mask for each key-
construct the first image. The photometric error in the re- point is generated, where only the keypoint pixel is encoded
construction defines the network loss. An improvement to (marked high or foreground).
this monocular depth learning from stereo pairs was intro- For training the lane detection network, CULane[31]
duced by Godard et al. by integrating left-right consistency dataset has been used. Every lane boundary ground truth
check [13]. Kundu et al. employed unsupervised adversar- was sampled into 40 keypoints, this number was chosen
ial learning for depth prediction [22]. based on the best possible trade-off between efficiency and
accuracy [20], with greater density of points closer to the
3. Our approach road plane horizon (Figure 3(b)). This is done by split-
ting every lane into three vertical segments and sampling
We use single view geometry for distance estimation. As equally spaced keypoints (20, 10 and 10) from each seg-
argued earlier, single view geometry depends on the per- ment starting from the one closer to the horizon. At in-
spective of the parallel or orthogonal lines and planes in the ference time, line/curve is fitted on the predicted keypoints
image. As lane lines are generally parallel, we exploit the using RANSAC [9]. A best fit line must have a minimum
parallelism for this matter. We start with unknown camera number of inlier markers, which in case of 40 markers/lane
extrinsics because it is usually difficult to obtain this infor- boundary were kept at 10.
1113
Encoder Network Decoder Network
2048xH/ 6xW/ 6 x4
1114
v Algorithm 1: Best fitting Horizon (lh ) estimation
Input: Detected markers by lane detector
b’ c’ Output: Estimated Horizon line
l’ 1 Find forward and lateral VP for each frame in a video:
a’
2 initialize;
l’1 l’2 l’3 3 for i ≤ num frames in video do
4 Fit lines to keypoints (B = num lines) ;
(a) 5 if B ≥2 then
6 for j ≤= B C 2 do
7 forward vpij = crossprod(line 1, line 2)
8 end
9 end
10 if B ≥3 then
11 for k ≤= B C 3 do
12 Get lateral vpik using cross ratios as described
in Section 3.2.2
13 end
(b)
14 end
Figure 5. (a) Perspective view of road plane. Lane lines (l1′ , l2′
and
15 end
l3′ ) are shown in red and an arbitrary line l′ (blue) cuts all lane lines
16 RANSAC for best fitting horizon across video:
at points a′ , b′ and c′ . Forward vanishing point (V) is obtained by
17 initialize;
the cross product of any two lane lines (b) Real world top-down
18 s = random int between 1 to num forward vps;
view of the road plane. d1 and d2 are the distances between lane
19 t = random int between 1 to num lateral vps;
lines l1 , l2 and l2 , l3 respectively. Since all lane lines are parallel
20 while iterations ≤ t OR num inliers ≤ max inliers do
△1 and △2 are similar triangles which implies ab bc
= d1
d2
21 Fit line lh to forward vp(s) and lateral vp(t);
22 if dist(lh ,forward vp(s)) ≤ threshold s &
dist(lh ,lateral vp(t)) ≤ threshold t then
23 record and count inlier points
ratios. Since we already have detected lane lines in the im- 24 end
age which lie on the road plane, we can find three co-linear 25 end
points by drawing any line l′ which intersects all three lane
lines at points a′ , b′ and c′ respectively.
The points a′ , b′ and c′ are in fact images of their real 3.2.3 Image rectification
world counter parts a, b, and c which lie on a line l as shown Now that the horizon has been determined, the IPM could
in Figure 5 (b). The line l′ is the image of line l. In order be found. We computed road plane normal n̂ using camera
to find the lateral vanishing point of the road plane, accord- intrinsic matrix K and horizon lh [14] by n̂ = K T lh . The
ing to the description above, we need to know the length IPM is then computed using HIP M = KRK −1 , where R =
ratios between these real world points. The fact that these ˆ −n]
[lh × n̂; (lh × n̂)× −n; ˆ is the rotation matrix and HIP M
three points lie on three (parallel) lane lines (l1 , l2 and l3 ), it is the rectification homography which will rotate the camera
can be established through trigonometry that the ratio of the view to align its Z-axis with the road’s normal as shown in
d1
line segments abbc is equal to d2 where d1 and d2 are the dis- Figure 4(Right). Frames could now be simply rectified into
tances between lines l1 , l2 and l2 , l3 respectively, as shown a bird’s eye view by applying this IPM (Figure 4(Right)).
in figure (b). Here d1 and d2 are the lane widths.
Although the algorithm will still work if neighbouring 3.2.4 Parallel line fitting
lanes don’t have same width (only requires ratio of the Parallel lines can be represented as:
widths i.e. dd12 to be known), we assume the ratio to be a1 x + b1 y + c1 = 0, a2 x + b2 y + c2 = 0
1 (d1 =d2 =lane width), since the lane lines are generally where the co-efficients a1 = a2 and b1 = b2 and only the
equidistant. Since the initialization is required only once intercept c1 and c2 are different. After a minimum number
for a camera, it can be done on a highway or a section of the of inlier constraint is satisfied, parallel lines are fitted to the
highway where lanes are consistent/equidistant (using gps). inlier markers using least squares by solving Ax = 0.
After getting two vanishing points (VPs) per frame, we Here A contains the inlier points (subscript) for each par-
collected all the sets of the vanishing points across all the allel line (superscript) in the rectified view. This system is
frames of a video clip. The horizon was found as the best solved using singular value decomposition for the least sin-
fitted line to the VPs as described in Algorithm 1. gular value.
1115
1 2 3
1 gt : 18
ours : 24
2 gt : 36
ours : 37
3 gt : 13
ours : 15 1 1 gt : 19
ours : 21
monodepth : 18 monodepth : 42 monodepth : 12 monodepth : 17
dorn : 19 dorn : 36 dorn : 14 dorn : 18
1 gt : 24
ours : 26
monodepth : 17
dorn : 21
radar : 22
1 gt : 33
2 gt : 23 1 2
ours : 33 ours : 26
1 monodepth : 29
dorn : 44
monodepth : 20
dorn : 19
radar : 62 radar : 24
Figure 6. Sample results on Kitti [12] (top) and nuScenes [3] (bottom). Displayed numbers are the distances in meters. Vehicles are
numbered for clarity.
3.2.5 Lane Width Initialization n̂
πr = (4)
−hc
In the rectified view of a given frame, a pair of consecutive
parallel lines were selected and the distance between the where n̂ is the road plane normal and πr is 4 × 1 recon-
lines was calculated per frame through: structed road plane vector. A ray from the point pi in the
image plane of the camera is given by γr = K −1 pi . So the
P |c −c |
√2 1 distance from the image plane to point where this ray inter-
f a2 +b2
LW p = . (1) sects the reconstructed road plane in the real-world can be
f
found by solving:
where f is the number of inlier frames and LW p is the
average initialized lane width. A frame is considered an dr γ r
πrT =0 (5)
inlier frame if it has at least one forward and one lateral 1
vanishing point.
where dr is the distance where γr intersects the plane:
3.2.6 Camera height estimation hc
dr = (6)
Now that we have lane width in pixels LW p , we can use the n̂ · γr
camera focal length (fc ) to compute viewing angle α alpha This is depicted in Figure 4 (Left). It is this simple arith-
using trigonometry metic calculation that estimates the distance of the detected
LW p /2 objects on the reconstructed road plane which makes it com-
α = 2 × tan−1 ( ) (2) putationally inexpensive. Shift in the horizon which can be
fc
caused by changing slope of the road can impart difference
Once camera viewing angle α is known, camera height hc to the observed distances. The best way to investigate the
is given by
LW r extent of error is to evaluate our algorithm on publicly col-
2 lected datasets and compare them against the computation-
hc = (3)
tan( α2 ) ally expensive but state-of-the-art deep learning approaches.
Here LW r is the lane width in real world and considering
the fact that lane widths are usually standard across high- 4. Experimental setup
ways, Lwr is generally known. Determining camera height To evaluate our distance estimation algorithm, we used
completes camera initialization. This is depicted in Figure the well known datasets i.e. KITTI [12], nuScenes [3]
4 (Right). and the recently released Lyft level 5 dataset [18]. All the
datasets have color images along with Lidar data which is
3.3. Distance estimation
used as distance ground truth. nuScenes also provide front
Once the camera is initialized, we know the camera Radar data. In order to localize vehicles in the scenes, we
height and road plane normal. We can then reconstruct the used our vehicle detector as described in section 3.1. During
road plane: the assessment, among the overlapping vehicle boxes only
1116
Algorithm δ < 1.25 δ < 1.252 δ < 1.253 Abs.Rel RMSE RMSE log Sq.Rel
(a) KITTI dataset
DORN[10] 0.92 0.96 0.98 0.11 2.44 0.18 0.44
MonoDepth[13] 0.82 0.94 0.98 0.13 6.34 0.21 1.36
Ours 0.53 0.89 0.98 0.29 6.24 0.30 2.02
(b) nuScenes dataset
DORN[10] 0.55 0.85 0.96 0.25 7.43 0.33 1.83
MonoDepth[13] 0.56 0.81 0.93 0.23 8.38 0.36 2.16
Radar 0.73 0.80 0.85 0.48 11.66 0.71 13.25
Ours 0.78 0.94 0.97 0.15 6.10 0.24 1.29
(c) Lyft Level 5 dataset
DORN[10] 0.11 0.30 0.67 0.73 11.43 0.57 7.91
MonoDepth[13] 0.03 0.12 0.45 0.50 12.23 0.77 5.50
Ours 0.69 0.87 0.93 0.17 7.31 0.32 1.53
Table 1. Distance estimation results on (a) KITTI (b) nuScenes (c) Lyft Level 5 datasets. Metrics in Red means higher is better. Metrics
are explained in [8, 13]. Metrics in Blue means lower is better. Best scores are in bold. The scores of our algorithm are underlined.
1117
supervised deep learning model (MonoDepth). Therefore, Algorithm DORN MonoDepth Ours
even after aiming for a highway driving scenario, it is still GPU 2.10 0.124 —
competing the state-of-the-art data driven approaches on a CPU 130 0.795 23e−06
variety of road types. For a fair assessment of robustness
and domain in-variance of the algorithms in consideration, Table 2. Processing time per frame in secs averaged over 100
frames of nuScenes. GPU: Nvidia GTX-1080 with 8 GB on chip
we tested these models along with our algorithms on some
RAM, CPU:Intel Core i7 7820HK 2.9 GHz with 32 GB system
other public datasets. RAM. Our algorithm’s time is post initialization run-time for dis-
tance calculation. During initialization, our lane detector was also
4.1.2 Tests on nuScenes
used at 0.124 secs/frame (one-time execution per camera).
nuScenes dataset contains recorded videos of mainly city
scenes. It has 1000 video clips out of which 150 com-
prise the validation set that we used in this assessment. The will not be required until the camera configuration is dis-
dataset contains clips from multiple cameras installed on the turbed. This brings our algorithm at par with radar and lidar
vehicle. We used the front mounted camera CAM FRONT data processing without demanding any excessive compute
for evaluation. Table 1 (b) show the results. power on onboard edge devices.
The results show that unlike KITTI, the deep learning
networks both supervised and unsupervised, do not perform 5. Conclusion
as well. Our algorithm however, performs better than all in- In this paper we have presented a single view geome-
cluding radar in both δ (0.78) and RMSE (6.10). Figure 6 try based relative distance estimation algorithm for road ve-
bottom row shows some results of nuScenes where horizon hicles. We have compared our algorithm with data driven
was estimated correctly and our algorithm performed well. deep learning (DL) based methods. Supervised deep learn-
In contrast, Figure 7 display example with sub-optimal hori- ing performed exceptionally well on KITTI data. The com-
zon estimation and hence incorrect distance estimate which promise of the DL based algorithms on nuScenes and Lyft
is a limitation of our system. dataset was perhaps due to domain variance since the mod-
els were trained on KITTI data only. Therefore, although
4.1.3 Tests on Lyft Level 5 Dataset
data driven approaches are the future, they do have their
Lyft dataset is the newest and the largest of all datasets. short comings. We instead proposed a geometry based al-
The setup is similar to nuScenes. We used the data from ternative that is simple, intuitive, independent of data and
the forward looking Cam0 which includes 22680 sample performs almost equally on all the tested datasets keeping
images and collected on relatively planar roads as opposed RMSE within a very narrow range i.e. 6.10 to 7.31. Our
to nuScenes making it the largest dataset in our evaluation algorithm uses ADAS module of lane detection only for the
setup. Table 1 (c) show the results. Since Monodepth and initialization part which is computationally expensive since
DORN are trained on KITTI, their performance is compro- it’s usually a DL based method. But initialization is a one-
mised on a new camera and its settings with RMSE of 12.23 off calculation and after that, our distance estimation works
and 11.43, respectively. Our algorithm, however, outper- real-time.
forms with an RMSE of 7.31. In this work, we also conclude that both geometric as
well as deep learning frameworks are strong contestants to
4.2. Processing efficiency
replace active sensors for ADAS systems as well as self-
In USA, the traffic rules require 3 to 5 secs headway driving cars which will be a great help in reducing the cost
from the vehicle in the front. Therefore, for an ADAS sys- of the intelligent transport systems of the future.
tem, real-time operation is imperative as the response time
is very short and compute power onboard is very limited. Acknowledgements
As argued earlier, this is one of the reasons for using lidars This work was funded by KeepTruckin Inc. Authors
and radars. In this regard, we compared our algorithm with greatly acknowledge the assistance of Mr. Numan Sheikh
others for the processing time and the results are shown in and Mr. Abdullah Hasan in the data collection and review
Table 2. Deep networks require high compute power. As far process.
as our method is concerned, there is no specific processing
requirements. It is just a single division and a dot product References
which takes 2.31e−05 secs on Core i7 CPU. The initializa- [1] https://www.businessinsider.com/googles-waymo-reduces-
tion part can use any available Lane detection framework lidar-cost-90-in-effort-to-scale-self-driving-cars-2017-1.
which will be in operation anyway within an ADAS or self- Business Insider, Google Waymo reduces cost of Lidars by
driving car. Our lane detection network discussed in section 90%, 2017.
3.1 runs at 0.125 secs/f rame. But it’s a one-off calcu- [2] http://www.cvlibs.net/datasets/kitti/eval depth.php? bench-
lation for a limited number of frames and re-Initialization mark=depth prediction. KITTI Depth prediction benchmark.
1118
[3] https://www.nuscenes.org/. NUScenes, 2018. [19] M. Khader and S. Cherian. An Introduction to Automotive
[4] O. Barinova, V. Konushin, A. Yakubenko, K. Lee, H. Lim, LIDAR (Texas Instruments). Technical report, 2018.
and A. Konushin. Fast Automatic Single-View 3-d Recon- [20] H. Khan, A. Rafaqat, A. Hassan, A. Ali, W. Kazmi, and
struction of Urban Scenes. In European Conference on Com- A. Zaheer. Lane detection using lane boundary marker net-
puter Vision, 2008. work under road geometry constraints. In IEEE Winter Con-
[5] Y. Cao, Z. Wu, and C. Shen. Estimating Depth From Monoc- ference on Applications of Computer Vision, Snowmass, Co).
ular Images as Classification Using Deep Fully Convolu- IEEE, 2020.
tional Residual Networks. IEEE Transactions on Circuits [21] V. Kolmogorov and R. Zabih. Multi-camera Scene Recon-
and Systems for Video Technology, 28(11):3174–3182, nov struction via Graph Cuts. In Proceedings of the 7th European
2018. Conference on Computer Vision-Part III, ECCV ’02, pages
[6] A. Criminisi. Accurate Visual Metrology from Single and 82–96, London, UK, 2002. Springer-Verlag.
Multiple Uncalibrated Images. Springer-Verlag, Berlin, Hei- [22] J. N. Kundu, P. K. Uppala, A. Pahuja, and R. V. Babu.
delberg, 2001. AdaDepth: Unsupervised Content Congruent Adaptation for
[7] A. Criminisi, I. Reid, and A. Zisserman. Single View Metrol- Depth Estimation. In IEEE Computer Vision and Pattern
ogy. International Journal of Computer Vision, 40(2):123– Recognition (CVPR), 2018.
148, nov 2000. [23] L. Ladicky, C. Häne, and M. Pollefeys. Learning the Match-
[8] D. Eigen, C. Puhrsch, and R. Fergus. Depth Map Predic- ing Function. CoRR, abs/1502.0, 2015.
tion from a Single Image using a Multi-Scale Deep Network. [24] D. C. Lee, M. Hebert, and T. Kanade. Geometric reason-
In IEEE Computer Vision and Pattern Recognition (CVPR), ing for single image structure recovery. EEE Conference
pages 1–9, 2014. on Computer Vision and Pattern Recognition, pages 2136–
[9] M. A. Fischler and R. C. Bolles. Random Sample Consensus: 2143, 2009.
A Paradigm for Model Fitting with Applications to Image
[25] Z. Li and N. Snavely. MegaDepth: Learning Single-View
Analysis and Automated Cartography. Communications of
Depth Prediction from Internet Photos. In IEEE, editor, IEEE
the ACM, 24(6):381–395, jun 1981.
Computer Vision and Pattern Recognition, pages 2041–
[10] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao.
2050, 2018.
Deep Ordinal Regression Network for Monocular Depth Es-
[26] D. Liebowitz, A. Criminisi, and A. Zisserman. Creating Ar-
timation. In IEEE Computer Vision and Pattern Recognition
chitectural Models from Images. Computer Graphics Forum,
(CVPR), 2018.
18(3):39–50, 1999.
[11] R. Garg, G. VijayKumarB., and I. D. Reid. Unsupervised
[27] D. Liebowitz and A. Zisserman. Metric rectification for
CNN for Single View Depth Estimation: Geometry to the
perspective images of planes. In International Conference
Rescue. In European Conference on Computer Vision, 2016.
on Computer and Pattern Recognition. IEEE Comput. Soc,
[12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision
1998.
meets robotics: The KITTI dataset. International Journal
of Robotics Research, 32:1231–1237, 2013. [28] B. Liu, S. Gould, and D. Koller. Single image depth esti-
[13] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised mation from predicted semantic labels. In IEEE Conference
Monocular Depth Estimation with Left-Right Consistency. on Computer Vision and Pattern Recognition, pages 1253–
In International Conference on Computer Vision and Pattern 1260, jun 2010.
Recognition (CVPR), 2017. [29] J. Oberhammer, N. Somjit, U. Shah, and Z. Baghchehsaraei.
[14] R. Hartley and A. Zisserman. More Single View Geome- RF MEMS for automotive radar. In D. Uttamchandani, ed-
try. In Multiple view geometry in computer vision, chap- itor, Handbook of Mems for Wireless and Mobile Applica-
ter 8, pages 205,216–219. Cambridge Univ. Press, 2nd edi- tions, Woodhead Publishing Series in Electronic and Optical
tion, 2003. Materials, chapter 16, pages 518–549. Woodhead Publish-
[15] R. Hartley and A. Zisserman. Projective Geometry and ing, 2013.
Transformation of 2D. In Mutiple View Geometry in Com- [30] J. Pan, M. Hebert, and T. Kanade. Inferring 3D layout of
puter Vision, chapter 2, pages 51–52. Cambridge University building facades from a single image. 2015 IEEE Conference
Press, 2nd edition, 2003. on Computer Vision and Pattern Recognition (CVPR), pages
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning 2918–2926, 2015.
for Image Recognition. Conference on Computer Vision and [31] X. Pan, J. Shi, P. Luo, X. Wang, and X. Tang. Spatial
Pattern Recognition (CVPR), pages 770–778, 2016. As Deep: Spatial CNN for Traffic Scene Understanding.
[17] D. Hoiem, A. A. Efros, and M. Hebert. Geometric context In AAAI Conference on Artificial Intelligence, pages 7276–
from a single image. In 10th IEEE International Conference 7283, 2018.
on Computer Vision, volume 1, pages 654–661, oct 2005. [32] K.-Y. Park and S.-Y. Hwang. Robust Range Estimation with
[18] V. Kesten, R. and Usman, M. and Houston, J. and Pandya, a Monocular Camera for Vision-Based Forward Collision
T. and Nadhamuni, K. and Ferreira, A. and Yuan, M. and Warning System. The Scientific World Journal, (923632):9,
Low, B. and Jain, A. and Ondruska, P. and Omari, S. and 2014.
Shah, S. and Kulkarni, A. and Kazakova, A. and Tao, C. [33] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN:
and Platinsky, L. and Jiang, W. and. Lyft Level 5 Dataset. Towards Real-Time Object Detection with Region Proposal
https://level5.lyft.com/dataset/, 2019. Networks. In C. Cortes, N. D. Lawrence, D. D. Lee,
1119
M. Sugiyama, and R. Garnett, editors, Advances in Neu-
ral Information Processing Systems 28, pages 91–99. Curran
Associates, Inc., 2015.
[34] A. Saxena, S. H. Chung, and A. Y. Ng. Learning Depth from
Single Monocular Images. In Proceedings of the 18th Inter-
national Conference on Neural Information Processing Sys-
tems, NIPS’05, pages 1161–1168, Cambridge, MA, USA,
2005. MIT Press.
[35] F. H. Sinz, J. Q. Candela, G. H. Bakir, C. E. Rasmussen,
and M. O. Franz. Learning Depth from Stereo. In C. E.
Rasmussen, H. H. Bülthoff, B. Schölkopf, and M. A. Giese,
editors, Pattern Recognition, pages 245–252, Berlin, Heidel-
berg, 2004. Springer Berlin Heidelberg.
[36] P. Sturm and S. Maybank. A Method for Interactive 3D
Reconstruction of Piecewise Planar Objects from Single Im-
ages. In British Machine Vision Conference, pages 265–274,
1999.
[37] TuSimple. TuSimple Velocity Estimation Challenge. pages
https://github.com/TuSimple/tusimple–benchmark/tre,
2017.
[38] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox,
and A. Geiger. Sparsity Invariant CNNs. In International
Conference on 3D Vision (3DV), 2017.
[39] B. L. . C. S. . Y. D. . A. van den Hengel ; Mingyi He. Depth
and surface normal estimation from monocular images us-
ing regression on deep features and hierarchical CRFs. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 1119–1127, jun 2015.
[40] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learn-
ing Depth from Monocular Videos using Direct Methods.
In IEEE Computer Vision and Pattern Recognition (CVPR),
2018.
[41] Z. Yin and J. Shi. GeoNet: Unsupervised Learning of Dense
Depth, Optical Flow and Camera Pose. In IEEE Conference
on Computer Vision and Pattern Recognition, 2018.
[42] A. Zaheer and S. Khan. 3D Metric Rectification using An-
gle Regularity. IEEE Winter Conference on Applications of
Computer Vision, pages 31–36, 2014.
[43] A. Zaheer, M. Rashid, M. A. Riaz, and S. Khan. Single-View
Reconstruction using orthogonal line-pairs. Computer Vision
and Image Understanding, 172:107–123, 2018.
[44] J. Žbontar and Y. LeCun. Stereo Matching by Training a
Convolutional Neural Network to Compare Image Patches.
Journal of Machine Learning Research, 17(1):2287–2318,
jan 2016.
[45] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and
I. Reid. Unsupervised Learning of Monocular Depth Esti-
mation and Visual Odometry with Deep Feature Reconstruc-
tion. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2018.
[46] Y. Zhang and T. Funkhouser. Deep Depth Completion of a
Single RGB-D Image. In IEEE Computer Vision and Pattern
Recognition (CVPR), 2018.
1120