0% found this document useful (0 votes)
101 views

Ali Real-Time Vehicle Distance Estimation Using Single View Geometry

1) The document proposes a single view geometric approach to estimate real-time vehicle distances using monocular camera images for advanced driver assistance systems and self-driving cars. 2) It determines the horizon and camera height from detected lane markings and uses inverse perspective mapping to calculate distances by back-projecting vehicle image points to the reconstructed road plane. 3) The method is compared to deep learning based monocular depth prediction and radar on three datasets, showing a consistent RMSE between 6.10 to 7.31 meters and outperforming other methods on two datasets. It is computationally inexpensive and domain invariant.

Uploaded by

German Solis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views

Ali Real-Time Vehicle Distance Estimation Using Single View Geometry

1) The document proposes a single view geometric approach to estimate real-time vehicle distances using monocular camera images for advanced driver assistance systems and self-driving cars. 2) It determines the horizon and camera height from detected lane markings and uses inverse perspective mapping to calculate distances by back-projecting vehicle image points to the reconstructed road plane. 3) The method is compared to deep learning based monocular depth prediction and radar on three datasets, showing a consistent RMSE between 6.10 to 7.31 meters and outperforming other methods on two datasets. It is computationally inexpensive and domain invariant.

Uploaded by

German Solis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Real-time vehicle distance estimation using single view geometry

Ahmed Ali, Ali Hassan, Afsheen Rafaqat Ali, Hussam Ullah Khan, Wajahat Kazmi, Aamer Zaheer
KeepTruckin Inc.
Lahore, Pakistan
{Ahmed.Ali, Ali.Hassan, Afsheen, Hussam.Khan, Wajahat.Kazmi, AZ} @KeepTruckin.com

Abstract

Distance estimation is required for advanced driver as-


sistance systems (ADAS) as well as self-driving cars. It
is crucial for obstacle avoidance, tailgating detection and
accident prevention. Currently, radars and lidars are pri-
marily used for this purpose which are either expensive or
offer poor resolution. Deep learning based depth or dis-
tance estimation techniques require huge amount of data
and ensuring domain invariance is a challenge. Therefore,
in this paper, we propose a single view geometric approach
Figure 1. Sample image from nuScenes [3]. Estimated horizon
which is lightweight and uses geometric features of the road (green line) and computed distances (meters) by all methods for
lane markings for distance estimation that integrates well each vehicle (1, 2 and 3) from the ego vehicle are displayed. Our
with the lane and vehicle detection modules of an existing method is compared against deep learning based depth estimation
ADAS. Our system introduces novelty on two fronts: (1) it as well as radar. GT is the ground truth which is provided by the
uses cross-ratios of lane boundaries to estimate horizon (2) high resolution lidar.
it determines an Inverse Perspective Mapping (IPM) and
camera height from a known lane width and the detected
horizon. Distances of the vehicles on the road are then cal- detection have been introduced to alert drivers in the run-up
culated by back projecting image point to a ray intersecting to perilous maneuvers. Real-time logging of this informa-
the reconstructed road plane. For evaluation, we used li- tion is crucial in this regard and can help fleet managers in
dar data as ground truth and compare the performance of scoring a driver’s behavior. It also makes an integral part
our algorithm with radar as well as the state-of-the-art deep of autonomous driving in path planning, obstacle and col-
learning based monocular depth prediction algorithms. The lision avoidance and hence in overall safety assessment of
results on three public datasets (Kitti, nuScenes and Lyft the self-driving cars.
level 5) showed that the proposed system maintains a con-
sistent RMSE between 6.10 to 7.31. It outperforms other Active sensing in the form of Radars (Radio Detection
algorithms on two of the datasets while on KITTI it falls be- and Ranging) and Lidars (Light Detection and Ranging) are
hind one (supervised) deep learning method. Furthermore, a common choice for measuring distances to surrounding
it is computationally inexpensive and is data-domain invari- obstacles. Radars have ranges up to 150m [29] but they
ant. generally have low resolution. Lidars, on the other hand,
offer higher spatial resolution [19] but they are too expen-
sive (cheapest costs US$ 7, 500 [1]). The biggest advantage
1. Introduction of using active sensing is its real-time operation as mini-
mal post processing is required. Scene depth is estimated
Human error is one of the major causes of driving acci- in a quick scan and simple thresholding may do the job.
dents. At the heart of safe driving practices lies the driver’s While improvement in the sensor technology and mass pro-
judgment to estimate their vehicle’s braking distance w.r.t duction is expected to reduce the cost of lidars by 2022 [19],
other road users. For this reason, Advanced Driver Assis- they still have limitations of compromised performance in
tance Systems (ADAS) have evolved substantially over the adverse weather conditions such as rain, snow and fog. In
last few years and features such as pedestrian and tailgating contrast, standard dashcams provide low cost and high reso-

1111
lution RGB images of the scene but will demand intelligent ular imaging and average vehicle width in real world to esti-
post processing using robust algorithms and higher compute mate horizon. Following that, the distance of the bottom of
power on board. the vehicle’s detected bounding box from the horizon pro-
In computer vision, the problem of distance estimation vided a hint of its relative distance. But they assumed an al-
falls in the broader area of scene depth estimation. To this ready known camera height which can be difficult if done on
end, various data-driven and geometric approaches have a multitude of vehicle across a fleet. In 2017, top perform-
been explored in literature. In data driven approaches, ing submission to TuSimple’s velocity estimation challenge
depth of objects is learnt through either training a super- employed deep learning based depth prediction techniques
vised depth regressor [10] or unsupervised disparity estima- combined with object (vehicle) detection to find relative ve-
tor [13, 45]. On the other hand, geometric approaches take locities [37]. Keeping this in view, we will have to look
into account the geometry of the features in the scene such into the research work done in the two domains i.e. our pro-
as Multi-View Stereo (MVS) and Single View Metrology posal of single view geometry for depth estimation and the
[43]. MVS has remained a primary focus for depth recov- state-of-the-art deep learning based depth prediction.
ery in the past [21]. It provides depth for each pixel via
dense matching of more than one image. MVS is limited 2.1. Single view geometry based depth estimation
by the availability of texture in the scene, the efficiency of In computer vision, single view based recovery of scene
the algorithms as well as the complexity of multi-camera geometry have been thoroughly studied [4, 36, 30, 17]. Lei-
configuration. The accuracy of the depth map may not be botwzki used geometric properties such as parallelism and
consistent throughout. Single view metrology, instead, uses orthogonality of structures to recover 3D models from one
a single image to compute 3D measurements using perspec- or two views without explicitly knowing the metric proper-
tive geometry [7]. Being ill-posed, it has remained limited ties of objects [26]. Lee et al. showed that even in the pres-
to images of structured environments where a combination ence of clutter, full 3D model can be recovered from a single
of parallel and orthogonal lines and surfaces are substan- image using geometric reasoning on detected line segment
tially available. In this paper, we use monocular dashcam produced from planar surfaces [24]. Recently, Zaheer et
images to estimate relative distances of vehicles on the road.
al. used orthogonal lines in a multi planar single view, to
For evaluation, we used lidar data as ground truth and es-
recover full 3D reconstruction of the scene [43]. These ap-
tablished a comparison of the data-driven approaches with
proaches are particularly interesting for indoor man-made
the radar data. We propose an efficient and data indepen-
environments or for outdoor scenes with man made struc-
dent algorithm for real-time distance estimation for flat road
tures around. The reason is that human architecture involves
surfaces using single view geometry. With the help of any
a profusion of parallel and orthogonal lines emanating from
available lane and vehicle detection modules (which are in-
planar surfaces.
tegral components of ADAS and self driving cars), it fully
Apart from 3D model recovery from a single view, sub-
exploits the structure in the scene. We will show that it com-
stantial work has also been done in single view metrology.
petes well with deep learning methods and under the geo-
Leibowitz and Zisserman proposed metric rectification of
metric constraints, it outperforms end-to-end deep learning
planes from single perspective view and used it to measure
based methods as well as the radar data both in accuracy and
angles, lengths and ratios under given constraints of one
efficiency making it suitable for edge devices. Since the au-
known angle, two unknown but equal angles and a known
tonomous driving is currently limited to high-way driving
length ratio [27]. Zaheer and Khan showed that projectively
scenarios, it is well suited for such tasks. And except for
distorted views of man-made 3D structures can be metric
the camera initialization phase, which is a one-time compu-
rectified using orthogonal plane-pairs [42]. Liu et al. esti-
tation, it comes at almost no computational expense.
mated depth from a single view given semantic segmenta-
The breakdown of the paper is as follows: Section 2 pro-
tion under geometric constraints [28]. On this topic, Crim-
vides a quick review of depth estimation literature. Section
inisi et al. explain single and multi view metric reconstruc-
3 discusses our approach in detail with ADAS modules for
tions in detail, especially determining camera pose, com-
lane and vehicle detection in section 3.1, camera initializa-
puting distance between planes parallel to a reference plane
tion in section 3.2 and distance estimation in section 3.3.
and also finding area and length ratios on any plane parallel
Section 4 outlines the datasets, training and testing. Results
to the reference plane [7, 6].
and a discussion on the results are presented in section 4.1
and section 5 concludes the paper. 2.2. Data-driven depth prediction

2. Related work Digressing from a geometric approach, data-driven


depth prediction techniques have used both supervised and
In literature, only a few works have employed ap- unsupervised learning for stereo as well as monocular im-
proaches similar to ours. Park and Hwang [32] used monoc- ages. Sinz et al. used stereo images in Gaussian process re-

1112
Video frames
gression to estimate pixel wise depth map avoiding compli- Yes Initialization Block
Camera
Horizon IPM  Image Parallel line
cations of stereo calibration [35]. Saxena et al. used monoc- Lane boundary
marker detection estimation computation rectification fitting
height
estimation
ular images of a variation of unstructured outdoor scenes Camera height, road plane normal

for training Markov Random Fields (MRFs) on both local If Post Initialization Block
initialization Vehicle
and global features for the recovery of fairly accurate depth Distance
phase bounding
estimation
No box
maps [34]. Ladicky et al. designed a stereo matching classi-
fier for predicting likelihood of matching pixel in one image
offset by disparity in second image [23]. Figure 2. Block diagram of our algorithm.
Owing to little success, the data-driven approaches for
depth estimation could not acquire a lot of attention until mation as it may change from time to time. Instead, we
the advent of deep learning. Deep learning based depth es- propose an automated way of finding camera height using
timation made a huge impact in terms of holistic and ac- horizon through cross-ratios of parallel lines [15]. Deter-
curate depth prediction so much so that a significant num- mination of camera height completes camera initialization
ber of contributions have surfaced over a short span of time which only needs to be done once for each rigid camera as-
[40, 46, 25, 41, 45]. Žbontar and LeCunn trained a small sembly. Road plane is then reconstructed and the distance
convolutional neural network for matching stereo pairs in- from other vehicles on the road can be calculated through
stead of handcrafted feature matching and showed results simple math. A block diagram of our full pipeline is shown
better in terms of both the accuracy of depth maps as well in Figure 2. For camera initialization, we assume a straight
as efficiency [44]. Eigen et al. used two deep networks in and flat road surface. Lane boundaries and vehicles can be
cascade [8] to predict high-level global depth followed by detected using any off-the-shelf lane and vehicle detection
refinement to low-level local depth in a supervised regres- algorithms. In the following section, we will now briefly de-
sion framework. This approach acquired a lot of traction scribe lane and vehicle detectors that are used in evaluation
and several improvements followed such as adding Condi- for distance estimation in this paper.
tional Random Fields (CRFs) by Li et al. [39]. Cao et al.
transformed monocular depth estimation from regression to 3.1. Lane and vehicle detection
a supervised classification problem [5]. Fu et al. used an or-
dinal regression loss in supervised training to eliminate slow To exploit road geometry, detection of lane boundaries
convergence during training and sub-optimal minimas [10]. is a crucial step for which we used the lane detection
Their method is the best performing one on Kitti benchmark network by Khan et al. [20] which uses a fully convolu-
[2] with iRMSE 12.98 (1/km), RMSE 2.6 (m). tional deep network for detecting keypoints sampled along
Supervised learning requires ground truth depth data the lane boundaries. The network architecture diagram
which in case of deep learning becomes challenging to ac- is shown in Figure 3(a). It consists of an encoder block
quire. Therefore, recently, some unsupervised approaches (ResNet50 [16] in this case), responsible for feature extrac-
have appeared that have performed exceptionally well. tion, and a decoder block that learns the relationship among
Garg et al. proposed a convolutional encoder network in extracted features and makes decision on top of it. The de-
which they use one of the stereo-pair style images and pre- coder block comprised of four convolutional layers, a single
dict a disparity map [11]. An inverse warp of the sec- transpose-convolution layer (that upsamples feature map by
ond image is then produced from the known inter-view dis- 4x) followed by the output convolution layer. In the output
placement and the disparity map, which is then used to re- layer, just like Mask-RCNN, one-hot mask for each key-
construct the first image. The photometric error in the re- point is generated, where only the keypoint pixel is encoded
construction defines the network loss. An improvement to (marked high or foreground).
this monocular depth learning from stereo pairs was intro- For training the lane detection network, CULane[31]
duced by Godard et al. by integrating left-right consistency dataset has been used. Every lane boundary ground truth
check [13]. Kundu et al. employed unsupervised adversar- was sampled into 40 keypoints, this number was chosen
ial learning for depth prediction [22]. based on the best possible trade-off between efficiency and
accuracy [20], with greater density of points closer to the
3. Our approach road plane horizon (Figure 3(b)). This is done by split-
ting every lane into three vertical segments and sampling
We use single view geometry for distance estimation. As equally spaced keypoints (20, 10 and 10) from each seg-
argued earlier, single view geometry depends on the per- ment starting from the one closer to the horizon. At in-
spective of the parallel or orthogonal lines and planes in the ference time, line/curve is fitted on the predicted keypoints
image. As lane lines are generally parallel, we exploit the using RANSAC [9]. A best fit line must have a minimum
parallelism for this matter. We start with unknown camera number of inlier markers, which in case of 40 markers/lane
extrinsics because it is usually difficult to obtain this infor- boundary were kept at 10.

1113
Encoder Network Decoder Network

2048xH/ 6xW/ 6 x4

(K* L)xH/ 2xW/ 2


3xHxW Lane Marker Map
Convolution BatchNorm Transpose Relu Pooling
layer layer Convolution layer layer
layer

Figure 4. (Left) Point pi from the bottom of the detected vehicle is


(a)
back projected to a ray γr intersecting road plane at a point which
gives the distance dr of the point from the camera on road plane
with normal vector n̂ (Right) Synthetically rotated camera view
to estimate camera height hc given focal length fc , lane width in
rectified image Lwp and lane width in real world Lwr

and robust way of estimating an IPM by exploiting road ge-


(b) ometry with no prior knowledge of the camera pose.
Figure 3. (a) The architecture of our lane detection network. The Our approach is aimed for a forward looking dashcam
encoder is ResNet50 and the decoder consists of 4 convolution installed inside the driver’s cabin. Configuration was done
layers, 1 transpose-convolution layer followed by a fully convo- in a right-handed frame of reference. The road plane normal
lutional output layer with one channel for each keypoint (b) The n̂ was along the Z-axis (upwards), Y-axis was in the driving
input image and keypoints sampled from the ground truth lane line direction and X-axis towards the right side of the driver as
are overlayed on the image. Number of keypoints increase towards shown in Figure 4(Left).
the vanishing point.
The estimation of IPM used a planar road scene having
no elevation and more than one lane (3 or more lane bound-
For vehicle localization, any detection framework can aries), such as a typical highway scenario (Figure 5). Crit-
be used. We fine-tuned COCO pre-trained Faster-RCNN ical part of initialization was the estimation of a vanishing
model [33] with Resnet-50 backbone on Kitti vehicle line or horizon which required at least three parallel lines in
dataset [12]. the rectified image.

3.2. Camera initialization 3.2.2 Horizon estimation


Camera initialization involves finding camera height and The horizon must be known to estimate the required cam-
road plane normal. This is our first contribution in this pa- era rotation for an IPM. A horizon is a vanishing line of a
per as along with camera intrinsics, we use road geometry plane which, in our case, is the road. In order to find van-
and lane width which has not been used before in this re- ishing line, we need two vanishing points. The first one can
gard. Integrating these known priors makes the process of be simply calculated through the cross product of the vec-
initialization self-sufficient and fully automatic. It’s a one- tors of any two lane lines expressed in homogeneous coor-
off calculation for every camera and stays constant until the dinates (see Figure 5 (a)). We call it the forward vanishing
camera is displaced or replaced altogether. It involves the point (V). Finding a second vanishing point, however, is a
following sub-modules: bit involved.
According to Hartley and Zisserman Chapter 2 [14], if
there are three co-linear points in an image a′ , b′ and c′ (as
3.2.1 Inverse Perspective Mapping
shown in Figure 5 (a)) with known real world length ratios
Inverse Perspective Mapping (IPM) is used to remove the ( ab
bc ) of their corresponding real world points (a, b and c) as
perspective distortion in an image. Application of an IPM shown in Figure 5 (b), we can find a point on the vanishing
can synthetically rotate the camera into a top (bird’s eye) line of the plane on which these points lie. Lets call this
view thus rendering the intersecting lines as parallel; the point the lateral vanishing point or V ′ .
way they actually are in the real-world. This is required in We take this concept and map it to our problem i.e. the
order to find the camera height which is required for dis- road plane. For this, we will need three co-linear points
tance estimation. Therefore, we introduce a novel, accurate existing on the road plane having known real world length

1114
v Algorithm 1: Best fitting Horizon (lh ) estimation
Input: Detected markers by lane detector
b’ c’ Output: Estimated Horizon line
l’ 1 Find forward and lateral VP for each frame in a video:
a’
2 initialize;
l’1 l’2 l’3 3 for i ≤ num frames in video do
4 Fit lines to keypoints (B = num lines) ;
(a) 5 if B ≥2 then
6 for j ≤= B C 2 do
7 forward vpij = crossprod(line 1, line 2)
8 end
9 end
10 if B ≥3 then
11 for k ≤= B C 3 do
12 Get lateral vpik using cross ratios as described
in Section 3.2.2
13 end
(b)
14 end
Figure 5. (a) Perspective view of road plane. Lane lines (l1′ , l2′
and
15 end
l3′ ) are shown in red and an arbitrary line l′ (blue) cuts all lane lines
16 RANSAC for best fitting horizon across video:
at points a′ , b′ and c′ . Forward vanishing point (V) is obtained by
17 initialize;
the cross product of any two lane lines (b) Real world top-down
18 s = random int between 1 to num forward vps;
view of the road plane. d1 and d2 are the distances between lane
19 t = random int between 1 to num lateral vps;
lines l1 , l2 and l2 , l3 respectively. Since all lane lines are parallel
20 while iterations ≤ t OR num inliers ≤ max inliers do
△1 and △2 are similar triangles which implies ab bc
= d1
d2
21 Fit line lh to forward vp(s) and lateral vp(t);
22 if dist(lh ,forward vp(s)) ≤ threshold s &
dist(lh ,lateral vp(t)) ≤ threshold t then
23 record and count inlier points
ratios. Since we already have detected lane lines in the im- 24 end
age which lie on the road plane, we can find three co-linear 25 end
points by drawing any line l′ which intersects all three lane
lines at points a′ , b′ and c′ respectively.
The points a′ , b′ and c′ are in fact images of their real 3.2.3 Image rectification
world counter parts a, b, and c which lie on a line l as shown Now that the horizon has been determined, the IPM could
in Figure 5 (b). The line l′ is the image of line l. In order be found. We computed road plane normal n̂ using camera
to find the lateral vanishing point of the road plane, accord- intrinsic matrix K and horizon lh [14] by n̂ = K T lh . The
ing to the description above, we need to know the length IPM is then computed using HIP M = KRK −1 , where R =
ratios between these real world points. The fact that these ˆ −n]
[lh × n̂; (lh × n̂)× −n; ˆ is the rotation matrix and HIP M
three points lie on three (parallel) lane lines (l1 , l2 and l3 ), it is the rectification homography which will rotate the camera
can be established through trigonometry that the ratio of the view to align its Z-axis with the road’s normal as shown in
d1
line segments abbc is equal to d2 where d1 and d2 are the dis- Figure 4(Right). Frames could now be simply rectified into
tances between lines l1 , l2 and l2 , l3 respectively, as shown a bird’s eye view by applying this IPM (Figure 4(Right)).
in figure (b). Here d1 and d2 are the lane widths.
Although the algorithm will still work if neighbouring 3.2.4 Parallel line fitting
lanes don’t have same width (only requires ratio of the Parallel lines can be represented as:
widths i.e. dd12 to be known), we assume the ratio to be a1 x + b1 y + c1 = 0, a2 x + b2 y + c2 = 0
1 (d1 =d2 =lane width), since the lane lines are generally where the co-efficients a1 = a2 and b1 = b2 and only the
equidistant. Since the initialization is required only once intercept c1 and c2 are different. After a minimum number
for a camera, it can be done on a highway or a section of the of inlier constraint is satisfied, parallel lines are fitted to the
highway where lanes are consistent/equidistant (using gps). inlier markers using least squares by solving Ax = 0.
After getting two vanishing points (VPs) per frame, we Here A contains the inlier points (subscript) for each par-
collected all the sets of the vanishing points across all the allel line (superscript) in the rectified view. This system is
frames of a video clip. The horizon was found as the best solved using singular value decomposition for the least sin-
fitted line to the VPs as described in Algorithm 1. gular value.

1115
1 2 3

1 gt : 18
ours : 24
2 gt : 36
ours : 37
3 gt : 13
ours : 15 1 1 gt : 19
ours : 21
monodepth : 18 monodepth : 42 monodepth : 12 monodepth : 17
dorn : 19 dorn : 36 dorn : 14 dorn : 18

1 gt : 24
ours : 26
monodepth : 17
dorn : 21
radar : 22

1 gt : 33
2 gt : 23 1 2
ours : 33 ours : 26
1 monodepth : 29
dorn : 44
monodepth : 20
dorn : 19
radar : 62 radar : 24

Figure 6. Sample results on Kitti [12] (top) and nuScenes [3] (bottom). Displayed numbers are the distances in meters. Vehicles are
numbered for clarity.

 
3.2.5 Lane Width Initialization n̂
πr = (4)
−hc
In the rectified view of a given frame, a pair of consecutive
parallel lines were selected and the distance between the where n̂ is the road plane normal and πr is 4 × 1 recon-
lines was calculated per frame through: structed road plane vector. A ray from the point pi in the
image plane of the camera is given by γr = K −1 pi . So the
P |c −c |
√2 1 distance from the image plane to point where this ray inter-
f a2 +b2
LW p = . (1) sects the reconstructed road plane in the real-world can be
f
found by solving:
where f is the number of inlier frames and LW p is the  
average initialized lane width. A frame is considered an dr γ r
πrT =0 (5)
inlier frame if it has at least one forward and one lateral 1
vanishing point.
where dr is the distance where γr intersects the plane:
3.2.6 Camera height estimation hc
dr = (6)
Now that we have lane width in pixels LW p , we can use the n̂ · γr
camera focal length (fc ) to compute viewing angle α alpha This is depicted in Figure 4 (Left). It is this simple arith-
using trigonometry metic calculation that estimates the distance of the detected
LW p /2 objects on the reconstructed road plane which makes it com-
α = 2 × tan−1 ( ) (2) putationally inexpensive. Shift in the horizon which can be
fc
caused by changing slope of the road can impart difference
Once camera viewing angle α is known, camera height hc to the observed distances. The best way to investigate the
is given by
LW r extent of error is to evaluate our algorithm on publicly col-
2 lected datasets and compare them against the computation-
hc = (3)
tan( α2 ) ally expensive but state-of-the-art deep learning approaches.
Here LW r is the lane width in real world and considering
the fact that lane widths are usually standard across high- 4. Experimental setup
ways, Lwr is generally known. Determining camera height To evaluate our distance estimation algorithm, we used
completes camera initialization. This is depicted in Figure the well known datasets i.e. KITTI [12], nuScenes [3]
4 (Right). and the recently released Lyft level 5 dataset [18]. All the
datasets have color images along with Lidar data which is
3.3. Distance estimation
used as distance ground truth. nuScenes also provide front
Once the camera is initialized, we know the camera Radar data. In order to localize vehicles in the scenes, we
height and road plane normal. We can then reconstruct the used our vehicle detector as described in section 3.1. During
road plane: the assessment, among the overlapping vehicle boxes only

1116
Algorithm δ < 1.25 δ < 1.252 δ < 1.253 Abs.Rel RMSE RMSE log Sq.Rel
(a) KITTI dataset
DORN[10] 0.92 0.96 0.98 0.11 2.44 0.18 0.44
MonoDepth[13] 0.82 0.94 0.98 0.13 6.34 0.21 1.36
Ours 0.53 0.89 0.98 0.29 6.24 0.30 2.02
(b) nuScenes dataset
DORN[10] 0.55 0.85 0.96 0.25 7.43 0.33 1.83
MonoDepth[13] 0.56 0.81 0.93 0.23 8.38 0.36 2.16
Radar 0.73 0.80 0.85 0.48 11.66 0.71 13.25
Ours 0.78 0.94 0.97 0.15 6.10 0.24 1.29
(c) Lyft Level 5 dataset
DORN[10] 0.11 0.30 0.67 0.73 11.43 0.57 7.91
MonoDepth[13] 0.03 0.12 0.45 0.50 12.23 0.77 5.50
Ours 0.69 0.87 0.93 0.17 7.31 0.32 1.53

Table 1. Distance estimation results on (a) KITTI (b) nuScenes (c) Lyft Level 5 datasets. Metrics in Red means higher is better. Metrics
are explained in [8, 13]. Metrics in Blue means lower is better. Best scores are in bold. The scores of our algorithm are underlined.

oDepth predicts disparity which is converted into depth us-


ing the stereo configuration used in the data acquisition
[Depth in meters = (FocalLength × Baseline in meters) /
Disparity]. Lidar and radar data were projected on the im-
age plane using camera matrix to produce depth maps. The
vehicle depth was computed by averaging all the non zero
(a)
depth values inside the vehicle box for lidar, MonoDepth
and DORN. The detected box size was reduced by 50% in
height and width with the same center in order to reduce the
influence of the background. For radar however, we took
the non zero minimum value inside the full sized box due
to the sparsity of the data (works best for radar). For our
(b) algorithm, we took the middle point of the bottom of the
Figure 7. Sample images from nuScenes [3]. Displayed numbers full sized box for distance estimation because our algorithm
are the distances in meters. Sub-optimal horizon as compared to requires the point to be on the road plane.
Figure 1 due to (a) downwards slope and (b) upwards slope. Re-
calculation required in this case. 4.1. Results and Discussion
4.1.1 Tests on KITTI
the foremost box was selected. Hence only the un-occluded KITTI dataset includes images from two color 1.4 MP
vehicles were used. Maximum distance cap of 70m was PointGrey Flea2 cameras (left one is used in these exper-
set since some lidars do not perform at larger distances. Be- iments) and a Velodyne HDL-64E lidar data. Lidar data
sides, 70m braking distance provides enough room for a fair was mapped to the cameras. It does not have radar data. We
assessment. used its six video clips which have the lidar as well as the
Distance was in meters and we used the metrics de- corresponding video data in the KITTI validation split.
fined in [8, 13], namely, Ratio threshold δ, Absolute Rel- Table 1 (a) show the results and Figure 6 top-row shows
ative difference (Abs Rel), Squared Relative difference (Sq some sample KITTI results. Based on the RMSE, we
Rel), RMSE (linear) and RMSE log for evaluation. Apart can see that supervised depth estimation by DORN per-
from our algorithm which is based on single view geom- formed much better than either MonoDepth or our algo-
etry, we also compared distance estimates of a supervised rithm. Both data driven approaches, DORN and Mon-
deep learning based method, DORN by Fu et al. [10] which oDepth were trained on the training and quiet possibly the
has shown the best performance on Kitti depth estimation validation sets of KITTI before the authors submitted the
benchmark [2]. We used their model which is trained on results to the KITTI test set benchmark [2]. Please note
the KITTI data for color images with lidar as ground truth. that the KITTI data was recorded by driving around the
From unsupervised deep learning methods, we use Mon- city of Karlsruhe, Germany, in the rural areas as well as on
oDepth by Godard et al. [13]. We used the model shared the highways [38]. And although our algorithm fell behind
by the authors which is trained on Kitti’s stereo data. Mon- DORN in performance, still it had lower RMSE than the un-

1117
supervised deep learning model (MonoDepth). Therefore, Algorithm DORN MonoDepth Ours
even after aiming for a highway driving scenario, it is still GPU 2.10 0.124 —
competing the state-of-the-art data driven approaches on a CPU 130 0.795 23e−06
variety of road types. For a fair assessment of robustness
and domain in-variance of the algorithms in consideration, Table 2. Processing time per frame in secs averaged over 100
frames of nuScenes. GPU: Nvidia GTX-1080 with 8 GB on chip
we tested these models along with our algorithms on some
RAM, CPU:Intel Core i7 7820HK 2.9 GHz with 32 GB system
other public datasets. RAM. Our algorithm’s time is post initialization run-time for dis-
tance calculation. During initialization, our lane detector was also
4.1.2 Tests on nuScenes
used at 0.124 secs/frame (one-time execution per camera).
nuScenes dataset contains recorded videos of mainly city
scenes. It has 1000 video clips out of which 150 com-
prise the validation set that we used in this assessment. The will not be required until the camera configuration is dis-
dataset contains clips from multiple cameras installed on the turbed. This brings our algorithm at par with radar and lidar
vehicle. We used the front mounted camera CAM FRONT data processing without demanding any excessive compute
for evaluation. Table 1 (b) show the results. power on onboard edge devices.
The results show that unlike KITTI, the deep learning
networks both supervised and unsupervised, do not perform 5. Conclusion
as well. Our algorithm however, performs better than all in- In this paper we have presented a single view geome-
cluding radar in both δ (0.78) and RMSE (6.10). Figure 6 try based relative distance estimation algorithm for road ve-
bottom row shows some results of nuScenes where horizon hicles. We have compared our algorithm with data driven
was estimated correctly and our algorithm performed well. deep learning (DL) based methods. Supervised deep learn-
In contrast, Figure 7 display example with sub-optimal hori- ing performed exceptionally well on KITTI data. The com-
zon estimation and hence incorrect distance estimate which promise of the DL based algorithms on nuScenes and Lyft
is a limitation of our system. dataset was perhaps due to domain variance since the mod-
els were trained on KITTI data only. Therefore, although
4.1.3 Tests on Lyft Level 5 Dataset
data driven approaches are the future, they do have their
Lyft dataset is the newest and the largest of all datasets. short comings. We instead proposed a geometry based al-
The setup is similar to nuScenes. We used the data from ternative that is simple, intuitive, independent of data and
the forward looking Cam0 which includes 22680 sample performs almost equally on all the tested datasets keeping
images and collected on relatively planar roads as opposed RMSE within a very narrow range i.e. 6.10 to 7.31. Our
to nuScenes making it the largest dataset in our evaluation algorithm uses ADAS module of lane detection only for the
setup. Table 1 (c) show the results. Since Monodepth and initialization part which is computationally expensive since
DORN are trained on KITTI, their performance is compro- it’s usually a DL based method. But initialization is a one-
mised on a new camera and its settings with RMSE of 12.23 off calculation and after that, our distance estimation works
and 11.43, respectively. Our algorithm, however, outper- real-time.
forms with an RMSE of 7.31. In this work, we also conclude that both geometric as
well as deep learning frameworks are strong contestants to
4.2. Processing efficiency
replace active sensors for ADAS systems as well as self-
In USA, the traffic rules require 3 to 5 secs headway driving cars which will be a great help in reducing the cost
from the vehicle in the front. Therefore, for an ADAS sys- of the intelligent transport systems of the future.
tem, real-time operation is imperative as the response time
is very short and compute power onboard is very limited. Acknowledgements
As argued earlier, this is one of the reasons for using lidars This work was funded by KeepTruckin Inc. Authors
and radars. In this regard, we compared our algorithm with greatly acknowledge the assistance of Mr. Numan Sheikh
others for the processing time and the results are shown in and Mr. Abdullah Hasan in the data collection and review
Table 2. Deep networks require high compute power. As far process.
as our method is concerned, there is no specific processing
requirements. It is just a single division and a dot product References
which takes 2.31e−05 secs on Core i7 CPU. The initializa- [1] https://www.businessinsider.com/googles-waymo-reduces-
tion part can use any available Lane detection framework lidar-cost-90-in-effort-to-scale-self-driving-cars-2017-1.
which will be in operation anyway within an ADAS or self- Business Insider, Google Waymo reduces cost of Lidars by
driving car. Our lane detection network discussed in section 90%, 2017.
3.1 runs at 0.125 secs/f rame. But it’s a one-off calcu- [2] http://www.cvlibs.net/datasets/kitti/eval depth.php? bench-
lation for a limited number of frames and re-Initialization mark=depth prediction. KITTI Depth prediction benchmark.

1118
[3] https://www.nuscenes.org/. NUScenes, 2018. [19] M. Khader and S. Cherian. An Introduction to Automotive
[4] O. Barinova, V. Konushin, A. Yakubenko, K. Lee, H. Lim, LIDAR (Texas Instruments). Technical report, 2018.
and A. Konushin. Fast Automatic Single-View 3-d Recon- [20] H. Khan, A. Rafaqat, A. Hassan, A. Ali, W. Kazmi, and
struction of Urban Scenes. In European Conference on Com- A. Zaheer. Lane detection using lane boundary marker net-
puter Vision, 2008. work under road geometry constraints. In IEEE Winter Con-
[5] Y. Cao, Z. Wu, and C. Shen. Estimating Depth From Monoc- ference on Applications of Computer Vision, Snowmass, Co).
ular Images as Classification Using Deep Fully Convolu- IEEE, 2020.
tional Residual Networks. IEEE Transactions on Circuits [21] V. Kolmogorov and R. Zabih. Multi-camera Scene Recon-
and Systems for Video Technology, 28(11):3174–3182, nov struction via Graph Cuts. In Proceedings of the 7th European
2018. Conference on Computer Vision-Part III, ECCV ’02, pages
[6] A. Criminisi. Accurate Visual Metrology from Single and 82–96, London, UK, 2002. Springer-Verlag.
Multiple Uncalibrated Images. Springer-Verlag, Berlin, Hei- [22] J. N. Kundu, P. K. Uppala, A. Pahuja, and R. V. Babu.
delberg, 2001. AdaDepth: Unsupervised Content Congruent Adaptation for
[7] A. Criminisi, I. Reid, and A. Zisserman. Single View Metrol- Depth Estimation. In IEEE Computer Vision and Pattern
ogy. International Journal of Computer Vision, 40(2):123– Recognition (CVPR), 2018.
148, nov 2000. [23] L. Ladicky, C. Häne, and M. Pollefeys. Learning the Match-
[8] D. Eigen, C. Puhrsch, and R. Fergus. Depth Map Predic- ing Function. CoRR, abs/1502.0, 2015.
tion from a Single Image using a Multi-Scale Deep Network. [24] D. C. Lee, M. Hebert, and T. Kanade. Geometric reason-
In IEEE Computer Vision and Pattern Recognition (CVPR), ing for single image structure recovery. EEE Conference
pages 1–9, 2014. on Computer Vision and Pattern Recognition, pages 2136–
[9] M. A. Fischler and R. C. Bolles. Random Sample Consensus: 2143, 2009.
A Paradigm for Model Fitting with Applications to Image
[25] Z. Li and N. Snavely. MegaDepth: Learning Single-View
Analysis and Automated Cartography. Communications of
Depth Prediction from Internet Photos. In IEEE, editor, IEEE
the ACM, 24(6):381–395, jun 1981.
Computer Vision and Pattern Recognition, pages 2041–
[10] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao.
2050, 2018.
Deep Ordinal Regression Network for Monocular Depth Es-
[26] D. Liebowitz, A. Criminisi, and A. Zisserman. Creating Ar-
timation. In IEEE Computer Vision and Pattern Recognition
chitectural Models from Images. Computer Graphics Forum,
(CVPR), 2018.
18(3):39–50, 1999.
[11] R. Garg, G. VijayKumarB., and I. D. Reid. Unsupervised
[27] D. Liebowitz and A. Zisserman. Metric rectification for
CNN for Single View Depth Estimation: Geometry to the
perspective images of planes. In International Conference
Rescue. In European Conference on Computer Vision, 2016.
on Computer and Pattern Recognition. IEEE Comput. Soc,
[12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision
1998.
meets robotics: The KITTI dataset. International Journal
of Robotics Research, 32:1231–1237, 2013. [28] B. Liu, S. Gould, and D. Koller. Single image depth esti-
[13] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised mation from predicted semantic labels. In IEEE Conference
Monocular Depth Estimation with Left-Right Consistency. on Computer Vision and Pattern Recognition, pages 1253–
In International Conference on Computer Vision and Pattern 1260, jun 2010.
Recognition (CVPR), 2017. [29] J. Oberhammer, N. Somjit, U. Shah, and Z. Baghchehsaraei.
[14] R. Hartley and A. Zisserman. More Single View Geome- RF MEMS for automotive radar. In D. Uttamchandani, ed-
try. In Multiple view geometry in computer vision, chap- itor, Handbook of Mems for Wireless and Mobile Applica-
ter 8, pages 205,216–219. Cambridge Univ. Press, 2nd edi- tions, Woodhead Publishing Series in Electronic and Optical
tion, 2003. Materials, chapter 16, pages 518–549. Woodhead Publish-
[15] R. Hartley and A. Zisserman. Projective Geometry and ing, 2013.
Transformation of 2D. In Mutiple View Geometry in Com- [30] J. Pan, M. Hebert, and T. Kanade. Inferring 3D layout of
puter Vision, chapter 2, pages 51–52. Cambridge University building facades from a single image. 2015 IEEE Conference
Press, 2nd edition, 2003. on Computer Vision and Pattern Recognition (CVPR), pages
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning 2918–2926, 2015.
for Image Recognition. Conference on Computer Vision and [31] X. Pan, J. Shi, P. Luo, X. Wang, and X. Tang. Spatial
Pattern Recognition (CVPR), pages 770–778, 2016. As Deep: Spatial CNN for Traffic Scene Understanding.
[17] D. Hoiem, A. A. Efros, and M. Hebert. Geometric context In AAAI Conference on Artificial Intelligence, pages 7276–
from a single image. In 10th IEEE International Conference 7283, 2018.
on Computer Vision, volume 1, pages 654–661, oct 2005. [32] K.-Y. Park and S.-Y. Hwang. Robust Range Estimation with
[18] V. Kesten, R. and Usman, M. and Houston, J. and Pandya, a Monocular Camera for Vision-Based Forward Collision
T. and Nadhamuni, K. and Ferreira, A. and Yuan, M. and Warning System. The Scientific World Journal, (923632):9,
Low, B. and Jain, A. and Ondruska, P. and Omari, S. and 2014.
Shah, S. and Kulkarni, A. and Kazakova, A. and Tao, C. [33] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN:
and Platinsky, L. and Jiang, W. and. Lyft Level 5 Dataset. Towards Real-Time Object Detection with Region Proposal
https://level5.lyft.com/dataset/, 2019. Networks. In C. Cortes, N. D. Lawrence, D. D. Lee,

1119
M. Sugiyama, and R. Garnett, editors, Advances in Neu-
ral Information Processing Systems 28, pages 91–99. Curran
Associates, Inc., 2015.
[34] A. Saxena, S. H. Chung, and A. Y. Ng. Learning Depth from
Single Monocular Images. In Proceedings of the 18th Inter-
national Conference on Neural Information Processing Sys-
tems, NIPS’05, pages 1161–1168, Cambridge, MA, USA,
2005. MIT Press.
[35] F. H. Sinz, J. Q. Candela, G. H. Bakir, C. E. Rasmussen,
and M. O. Franz. Learning Depth from Stereo. In C. E.
Rasmussen, H. H. Bülthoff, B. Schölkopf, and M. A. Giese,
editors, Pattern Recognition, pages 245–252, Berlin, Heidel-
berg, 2004. Springer Berlin Heidelberg.
[36] P. Sturm and S. Maybank. A Method for Interactive 3D
Reconstruction of Piecewise Planar Objects from Single Im-
ages. In British Machine Vision Conference, pages 265–274,
1999.
[37] TuSimple. TuSimple Velocity Estimation Challenge. pages
https://github.com/TuSimple/tusimple–benchmark/tre,
2017.
[38] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox,
and A. Geiger. Sparsity Invariant CNNs. In International
Conference on 3D Vision (3DV), 2017.
[39] B. L. . C. S. . Y. D. . A. van den Hengel ; Mingyi He. Depth
and surface normal estimation from monocular images us-
ing regression on deep features and hierarchical CRFs. In
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 1119–1127, jun 2015.
[40] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learn-
ing Depth from Monocular Videos using Direct Methods.
In IEEE Computer Vision and Pattern Recognition (CVPR),
2018.
[41] Z. Yin and J. Shi. GeoNet: Unsupervised Learning of Dense
Depth, Optical Flow and Camera Pose. In IEEE Conference
on Computer Vision and Pattern Recognition, 2018.
[42] A. Zaheer and S. Khan. 3D Metric Rectification using An-
gle Regularity. IEEE Winter Conference on Applications of
Computer Vision, pages 31–36, 2014.
[43] A. Zaheer, M. Rashid, M. A. Riaz, and S. Khan. Single-View
Reconstruction using orthogonal line-pairs. Computer Vision
and Image Understanding, 172:107–123, 2018.
[44] J. Žbontar and Y. LeCun. Stereo Matching by Training a
Convolutional Neural Network to Compare Image Patches.
Journal of Machine Learning Research, 17(1):2287–2318,
jan 2016.
[45] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and
I. Reid. Unsupervised Learning of Monocular Depth Esti-
mation and Visual Odometry with Deep Feature Reconstruc-
tion. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2018.
[46] Y. Zhang and T. Funkhouser. Deep Depth Completion of a
Single RGB-D Image. In IEEE Computer Vision and Pattern
Recognition (CVPR), 2018.

1120

You might also like