Ar Xiv
Ar Xiv
net/publication/347112693
CITATION READS
1 1,695
6 authors, including:
All content following this page was uploaded by Xinshuo Weng on 15 December 2020.
Xinshuo Weng, Yunze Man, Dazhi Cheng, Jinhyung Park, Matthew O’Toole, Kris Kitani
Robotics Institute, Carnegie Mellon University
{xinshuow, yman, dazhic, jinhyun1, motoole2, kkitani}@cs.cmu.edu
Abstract AIODrive
High-density
Developing datasets that cover comprehensive sensors, Point cloud long-range
segmentation point clouds
annotations and full data distribution is important for SemanticKITTI
Violation of
innovating robust multi-sensor multi-task perception sys- Image traffic rules
High-speed
tems. Though many datasets have been released, they tar- segmentation
driving
Cityscape A*3D Accidents
get for different use-cases such as 3D segmentation (Se- SPAD-
LiDAR
manticKITTI), radar sensing (nuScenes), large-scale train- Rich maps
Large-scale
ing (Waymo). As a result, we are still in need of a dataset Argoverse
3D detection
and tracking data
that forms a union of various strengths of existing datasets. Adverse Waymo
To address this challenge, we present the AIODrive dataset, weather
1
dataset. Synthetic data generation is able to meet the chal- Velodyne-64 Point Cloud (Bird’s Eye View)
lenges of creating a comprehensive perception dataset be-
cause: (1) a large amount of diverse data can be generated
in simulation as the Carla simulator can change the density
of traffic, velocity of agents, generate violations of traffic
rules and change weather and lighting; (2) large amounts
of annotation for a multitude of tasks can be generated by
combining and post-processing Carla outputs. For exam- Our High-Density Long-Range Point Cloud
ple, we can project 2D semantic annotation to 3D given the
depth image, resulting in 3D semantic annotation for point
clouds. Then, combining with 3D bounding box annota-
tion, 3D semantic annotation can be converted to 3D in-
stance and panoptic segmentation; (3) A ‘physical’ yet af-
fordable sensing platform can be constructed in simulation
to change sensor configuration and even create sensors that
are not yet available in public datasets (early prototype is
available in industry), e.g., long-range high-density LiDAR Figure 2: (Top) Velodyne-64 [1] point cloud with 100k
and SPAD-LiDAR as shown in Figure. 2. These powerful points and a range of 120m. (Bottom) Point cloud from our
sensors can help advance research in long-range perception. LiDAR sensor has 1M points and a range of 1km, which
To summarize, our AIODrive dataset provides: can be used to innovate long-range perception systems.
(1) Eight sensor modalities: 5 high resolution RGB cam-
of prior synthetic datasets, we believe that the usefulness of
eras (1 stereo pair); 5 depth cameras, 1000 meter range
our dataset is also undoubted, as validated by our experi-
LiDAR at multiple levels of density (up to 1M points),
ments on real datasets. Again, we emphasize that the role
1000 meter range SPAD-LiDAR, Radar, IMU, and
of our dataset is not to replace real datasets. Instead, it can
GPS. Four of the sensors have 360◦ horizontal cover-
be used in concert with real data, such as using our data to
age (camera, LiDAR, SPAD-LiDAR, Radar);
pre-train detectors to improve performance on real data or
(2) Annotations for mainstream perception tasks: 2D/3D
using our rare driving data as out-of-distribution test data.
semantic, instance and panoptic segmentation, fine-
The broader impact of our AIODrive dataset is its com-
grained object categories, 2D/3D bounding boxes, ob-
prehensive nature allowing for development and evalua-
ject trajectories, velocity and acceleration;
tion of multi-sensor multi-task perception systems that are
(3) Diverse environmental variations: adverse weather and
sometimes not possible with existing datasets. The AIO-
lighting, crowded scenes, people running, high-speed
Drive dataset includes a super-set of sensors, annotations
driving, violations of the traffic rule, and car accidents.
and environmental variations needed to develop novel per-
Though synthetic data generation can be used to create ception systems. Also, our AIODrive dataset is the largest
a comprehensive dataset, one might argue that the domain driving dataset with over 250k frames with 26M 2D/3D la-
gap between synthetic and real data is a weakness. In our beled object instances, roughly double the size of Waymo
defense, we believe that the usefulness of synthetic datasets dataset in terms of labeled instances. We will release our
is firmly predicated on a body of prior work [47, 38, 45, 22] dataset to the community for free use and open a series of
that has shown, when synthetic data is used correctly, it can challenges, such as long-range 3D object detection based on
be used to enhance perception performance on real data. our long-range high-density point cloud data.
For example, [38] showed that using synthetic data for aug-
mentation can improve performance for depth prediction on 2. Related Work
real NYU [50] and SUN RGB-D [51] datasets. [45] showed Perception Dataset. Sensors, environmental variations and
that using synthetic data created from Unity with free an- annotations are key aspects of perception datasets for au-
notation of semantic segmentation can improve segmenta- tonomous driving [57]. In terms of the annotation, KITTI
tion performance on real-world datasets such as KITTI [17], [17] provides 2D/3D box trajectory labels on images and
CamVid [10], LabelMe [46], CBCL [7]. Also, [47] showed LiDAR data, enabling object detection, tracking and fore-
that augmenting with LiDAR point clouds generated from casting. To enable image segmentation research, Cityscape
Carla simulator can improve bird’s eye view 2D detection [14], Mapillary [39], Apolloscape [56], SYNTHIA [45]
performance on the real-world KITTI dataset. [22] showed datasets are proposed, each having an increased number of
that using GTA-V [44] to synthesize LiDAR point clouds annotated frames. For 3D segmentation, SemanticKITTI
for pre-training 3D object detectors can improve 5% aver- [6] released point-wise semantic labels on point clouds. As
age precision on the KITTI dataset. Similar to the success map information such as drivable area is useful to percep-
2
tion, Argoverse [13] annotates rich map annotations to in- though AirSim has advantages in aerial data capture. In ad-
novate novel perception algorithm requiring the map data. dition to simulators, commercial video games such as GTA-
In addition to annotations, perception datasets also need V [44] can also be used for synthetic data generation but
diverse environmental variations to capture the rare driving they do not allow low-level control of scene elements. Ac-
situations. As prior datasets usually have a small number cordingly, we have selected to use Carla for data generation
(<10) of agents per frame without complex interactions be- as it affords the most flexibility and customization.
tween agents, H3D [40] was released, with an average of 37 Long-Range Perception. Increasing the maximum sens-
agents per frame to include data in highly-crowded scenar- ing range of perception systems is important for safety in
ios with complex agent-agent interactions. To deal with ad- high-speed driving scenarios. However, LiDAR used in ex-
verse weather and lighting conditions, recent datasets such isting datasets has limited range, e.g., 120m in KITTI [17],
as CADC [42], nuScenes [12], A*3D[41], Waymo [53] col- 70m in nuScenes [12], 75m in Waymo [53]. Assuming per-
lected data under rainy, snowy, foggy, dusky and night con- fect detection accuracy and zero algorithmic latency, a car
ditions. As prior datasets usually acquired data at a low moving at a speed of 120km/h will have at most 3.6 sec-
driving speed (e.g., average 16 km/h in nuScenes), A*3D onds to respond to a detected obstacle with a 120m-range
dataset [41] was proposed to collect data at a much higher LiDAR. Naturally, enabling perception at a longer-range is
speed (e.g., 40-70 km/h), in order to include more high- preferred for increased safety. To the best of our knowl-
speed driving data that is common in the real-world. edge, [73] is the only work exploring a scenario with up
Regarding the sensing modalities, nuScenes [12] col- to 300m of depth sensing using three high-resolution RGB
lected the first dataset with Radar data, in addition to stan- cameras. In contrast, our work uses a simulator to collect
dard RGB camera, LiDAR, IMU, and GPS sensors. As ear- long-range high-density point clouds. We believe that our
lier datasets collected data in the frontal direction only, ig- data can help aid in the development of long-range percep-
noring objects to the sides or rear that are also important tion algorithms before data from real-world long-range sen-
to decision-making in driving, Argoverse [13], Audi [18], sors become widely available to the research community.
and nuScenes [12] equip their vehicles with multiple Li-
DAR and camera sensors for 360◦ data capturing. 3. The AIODrive Dataset
In comparison to existing datasets with a subset of sen-
For each scene in our dataset, we choose one of eight
sors, annotations and environmental variations, we provide
cities from Carla assets and sample locations covering the
a super-set of sensors, annotations and environmental vari-
entire city to generate agents. For each agent (vehicles,
ations. Also, beyond standard LiDAR such as Velodyne-
people), we set a random and faraway target destination to
64 [1] used in prior datasets for data collection, we pro-
generate diverse trajectories. We randomly customize the
vide LiDAR sensors with 10× larger sensing range and 4
behavior (e.g., maximum speed, how often to ignore red
levels of point densities, with the highest level having 10×
light, how often to cross the road) for each agent to increase
higher resolution (point density) than Velodyne-64. Impor-
the diversity of the data. Once the environment is set up,
tantly, the design of our long-range LiDAR sensors is not
we randomly select a vehicle as our ego-vehicle and equip
imaginary but based on active developments in new LiDAR
our full sensor suite to this vehicle for data recording. For
sensors such as AlphaPrime [4], Ouster [5] and Panasonic
agents who have approached their destinations, we provide
[3], which are developed with higher-resolution and longer-
them another faraway destination so that there is no dummy
range (e.g., 300m) depth sensing. In addition to providing
agent in our environment. We collected 250 such scenes
LiDAR sensors, also referred to as APD-LiDAR (avalanche
in our dataset, each containing 1000 frames with full set of
photodiodes), our dataset also provides SPAD-LiDAR (sin-
annotations. As shown in Table 1, our dataset has the most
gle photon avalanche diode) sensor which records photon
number of annotated frames compared to other datasets.
counts over space and time. This type of SPAD-LiDAR
sensor, although available in industry [2, 11], is not found 3.1. Comprehensive Sensor Suite
in public perception datasets for research purpose.
To increase robustness to sensor failure, multi-sensor
Synthetic Data Generation. Though many existing sim- perception approaches [29, 43, 65, 59, 69, 30, 25] are often
ulators (e.g., Sim4CV [37], Nvidia Drive [8]) can be used more favorable than single-sensor approaches [49, 36, 60,
for synthetic data generation, most of these simulators are 71, 62]. To innovate multi-sensor approach, it is crucial that
not open-source (not easy to make modifications) and free- datasets can provide comprehensive sensing modalities. To
to-use license is not available (i.e., derivative products are that end, we provide common sensors such as RGB, Depth,
not allowed). For the open-sourced simulators, AirSim [48] Stereo camera, LiDAR, IMU and GPS, as well as the Radar
and Carla [16] are popular due to detailed documentation and SPAD-LiDAR sensors, which are often not available in
and diverse sensors. However, AirSim does not allow low- prior work as shown in Table 1 (except for nuScenes provid-
level control over every agent in the way that Carla allows, ing the Radar data). To the best of our knowledge, we are
3
Table 1: Comparison of size and sensor modalities. Our dataset has the most comprehensive sensors while being the largest.
Dataset # of cities # of hours # of sequences # of annotated images Stereo Depth LiDAR Radar SPAD-LiDAR IMU/GPS All 360◦
KITTI [17] 1 1.5 22 15k 3 3 3 3
Cityscape [14] 27 2.5 0 5k 3 3
Mapillary Vistas [39] 30 - - 25k
ApolloScape [21, 56] 4 - - 140k 3 3 3
SYNTHIA [45] 1 2.2 4 200k 3 3
H3D [40] 4 0.8 160 27k 3 3
SemanticKITTI [6] 1 1.2 22 43k 3
DrivingStereo [52] - 5 42 180k 3 3 3 3
Argoverse [13] 2 0.6 113 22k 3 3 3 3
EuroCity [9] 31 0.4 - 47k
CADC [42] 1 0.6 75 7k 3 3
Audi [18] 3 0.3 3 12k 3 3 3 3 3
nuScenes [12] 2 5.5 1k 40k 3 3 3
A*3D [41] 1 55 - 39k 3 3
Waymo Open [53] 3 6.4 1150 230k 3
Ours (AIODrive) 8 6.9 250 250k 3 3 3 3 3 3 3
the first to provide the SPAD-LiDAR data in public percep- ilar specifications to help aid in the development of long-
tion datasets. Also, our camera, LiDAR, Radar and SPAD range perception systems. Specifically, we provide three
sensors all have 360◦ horizontal field of view (FoV). LiDAR sensors, each with a resolution (density) of 100k,
Sensor Specifications. We show sensor descriptions in Ta- 600k, 1M points per frame. Each point in the cloud is a tu-
ble 2. Our sensor suite contains five (four for 360◦ sensing ple of (x, y, z, r), where (x, y, z) is the 3D location. Also,
and one for stereo) RGB and five depth cameras, as well as r is the simulated reflectance (also called intensity) value,
three LiDAR, one Radar, one SPAD-LiDAR and IMU/GPS which depends on many factors such as the sensor’s attenu-
sensors. All sensors are synchronized and we use the same ation factor, distance of the point, and color of the reflection
capturing frequency of 10Hz for all sensors. surface. The first LiDAR with 100k points and a range of
120m is to mimic the Velodyne-64, and the other two high-
Sensor Layout and Coordinate System. We follow KITTI density long-range LiDARs are provided to innovate long-
and use the right-hand rule for coordinate systems. Specif- range perception systems. All LiDARs are spinning and
ically, for camera coordinate, we use x axis for the right, collecting point clouds via ray-casting. To increase the re-
y axis pointing downward and z axis for the front direc- alism of the LiDAR point clouds, two augmentation mech-
tion. For LiDAR/Radar and IMU/GPS coordinate, we use anisms are used: (1) we randomly drop a small portion of
x axis for the front, y axis for the left and z axis pointing points based on their intensity values, i.e., the lower the in-
upward. We summarize sensor layout and coordinate sys- tensity is, the higher probability to be dropped; (2) we ran-
tems in Figure 3. To avoid transforming the coordinate be- domly perturb a small portion of points along the direction
tween LiDAR, Radar, IMU and GPS sensors, we place these of the laser ray, creating noisy distance measurements.
sensors at the same location (on top of the ego-vehicle) for In addition to high-density point clouds from LiDAR,
convenience, which is possible in the simulator. we also provide point clouds obtained from depth images.
High-Density Long-Range Point Cloud. To ensure safety Specifically, we project depth images from five cameras to
in high-speed driving scenarios, long-range perception [73] 3D space to obtain five point clouds, and then fuse them to
is critical. To innovate long-range perception systems, we obtain a full-surround point cloud with 4M points and 1km
as a community need public datasets that collect data us- sensing range (see supplementary for details). We refer to
ing longer-range LiDAR sensors than standard 120m-range this point cloud obtained from depth images as the depth
Velodyne-64 [1]. In anticipation of new high-density long- point cloud. We show a comparison of the Velodyne-64
range LiDAR sensors such as AlphaPrime [4], OS2 [5] and depth point cloud in Figure 4. For a car at 130 meters,
and Panasonic [3], we simulate LiDAR sensors with sim- depth point cloud can capture a decent number of points
4
Table 3: Comparison of annotation availability. We provide the most diverse annotations for all mainstream perception tasks.
Dataset # of 2D bounding boxes # of 3D bounding boxes Trajectory Image seg. Point cloud seg. Motion dynamics Fine-grained object class Map
KITTI [17] 80k 80k 3
Cityscape [14] 65k - 3
Mapillary Vistas [39] 200k - 3
ApolloScape [21, 56] 2.5M 70k 3 3
SYNTHIA [45] - - 3
H3D [40] - 1M 3
SemanticKITTI [6] - - 3
DrivingStereo [52] - -
Argoverse [13] - 993k 3 3
EuroCity [9] 238k -
CADC [42] - 344k
Audi [18] - 42k 3 3
nuScenes [12] - 1.4M 3 3
A*3D [41] - 230k
Waymo Open [53] 9.9M 12M 3
Ours (AIODrive) 26M 26M 3 3 3 3 3 3
5
Figure 5: 2D-3D Bounding Box Trajectory Annotation. For each agent, we provide both the 2D (left) and 3D (right) tight
box annotation, along with a unique identity across videos. We denote different object identities with a different color.
Figure 6: 2D-3D Segmentation Annotation. We provide both the 2D image (top) and 3D point cloud (bottom) segmentation
annotation. From left to right, we show semantic, instance and panoptic segmentation respectively.
mentation, video object segmentation, point cloud segmen- Table 4: Comparison of environmental variations. We pro-
tation, multi-object tracking and segmentation (MOTS) [55] vide the most variations with many rare driving scenarios.
and multi-object panoptic tracking (MOPT) [23]. Dataset Adv. wea./light. Crowded High-speed Vio. of rule Acci.
KITTI [17]
Other labels. In addition to above mainstream annotations, Cityscape [14]
we also provide other annotations: (1) motion data for all Mapillary Vistas [39] 3
ApolloScape [21, 56] 3
agents including linear velocity, acceleration, and angular SYNTHIA [45] 3
velocity. Our motion data can be useful to ego-motion esti- H3D [40] 3
SemanticKITTI [6]
mation, velocity estimation, tracking; (2) Fine-grained ob- DrivingStereo [52] 3
ject class labels such as vehicle model class of Audi A2, Argoverse [13] 3
EuroCity [9] 3
Toyota Prius and Tesla Model 3; (3) Vehicle control sig- CADC [42] 3 3
nals such as throttle, steer, brake, and reverse; (4) City map Audi [18] 3
nuScenes [12] 3 3
and road structure, which is useful to localization, odome- A*3D [41] 3 3
try and trajectory forecasting. Also, our large-scale dataset Waymo Open [53] 3 3
Ours (AIODrive) 3 3 3 3 3
with point clouds and depth images can be used for point
cloud forecasting [63] and depth estimation [35]. See sup- datasets often have cars driving at a low speed and barely
plementary for details of other annotaions. have data of violation of traffic rules, let alone car accidents.
3.3. High Environmental Variations Instead, our dataset contains all these rare driving scenarios
and has the highest environmental variations.
To train perception systems robust to rare driving sce-
narios such as adverse weather, violation of traffic rule, car Crowded Scenes. Driving in crowded scenes is challeng-
accidents, it is important to first include a large number of ing as interactions between agents are complex and colli-
these rare driving data in the dataset. However, collecting a sion might happen. To address the challenge, datasets with
large number of such data is difficult in the real world be- highly crowded scenes are needed in order to train percep-
cause they are rare to happen and can be dangerous or at a tion systems robust to the scenarios. To that end, we collect
high cost, especially for car accidents. To collect such rare many scenes with a high agent density. On average, we have
driving data without causing any danger, we leverage the 104 agents per frame within the sensing range of our sen-
simulator to intentionally generate rare data and increase sors. We show the comparison of agents per frame and total
our environmental variations. We compare the environmen- labeled instances between datasets in Figure 7 (a). Note
tal variations between datasets in Table 4. Though recent that some datasets such as KITTI and Cityscape have a sig-
datasets mostly have weather/lighting conditions, some are nificantly lower number of labeled instances because they
limited by having too few number of agents. Also, existing only label objects in front. Though existing datasets such as
6
Table 5: Quantitative results of 2D and 3D object detection baselines on the test split of our AIODrive dataset.
Car Pedestrian Cyclist
Method Input Data Output Modalities
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
FPN [31] RGB from 5 cameras 2D 89.45 78.66 69.51 92.88 87.28 75.50 94.15 90.80 72.10
PointRCNN [49] Depth point cloud 3D 78.13 77.99 73.63 58.73 53.71 44.74 59.03 53.85 49.36
PointPillars [26] 80.86 77.39 69.77 55.37 47.79 40.94 60.72 50.20 46.35
SECOND [70] 81.35 79.38 70.57 62.32 59.23 54.34 61.45 58.49 52.86
20%
2%
60 1500
Percentage
Percentage
2% 15%
40 1000
2%
10%
20 500 1%
0 0 0%
5%
Waymo
Audi
Cityscape
KITTI
A*3D
EuroCity
CADC
Argoverse
H3D
nuScenes
Ours
0% 0%
0 20 40 60 80 100 120 0 2 4 6 8 10 12 14
Driving Speed (km/h) Pedestrian Speed (km/h)
(a) High Crowdness (b) Driving Speed Distribution (c) People Speed Distribution
Figure 7: Data Statistics: (a) We compare agent density in terms of agents per frame and total labeled agents, which shows
that our dataset has more labeled instances; (b)(c) We compare the speed of the ego-vehicle and pedestrians, showing that
our data is collected at the speed closer to our normal daily driving, and we have more jogging and running pedestrians.
H3D, nuScenes, Waymo and Argoverse also have crowded [15] and COCO [32]. We then fine-tune the baseline on
scenes, about 30 to 50 agents per frame, our dataset is twice AIODrive. The results are shown in the 1st row of Table 5.
crowded and have a much higher number of labeled objects. We can see that FPN’s performance is reasonable but lower
High-Speed Driving. Existing datasets often collect data than its performance on KITTI, e.g., 93.53/89.35/79.35 for
at a low driving speed (e.g., nuScenes at 16km/h on aver- car in the easy/moderate/hard level. We believe this is be-
age), which is significantly different from our normal daily cause: (1) our evaluation requires detection at a larger range
driving speed, i.e., 30 to 60km/h on local road and 80 to (more difficult) than KITTI, e.g., our ‘hard’ level requires
120km/h on highway. To bridge the gap and mimic our detection of objects up to 120 meters while KITTI ‘hard’
daily driving, we collect data by driving our ego-vehicle at level requires detection up to 70 meters; (2) AIODrive has
a much higher speed as shown in Figure 7 (b). Specifically, a much higher object density than KITTI. As a result, there
our driving speed ranges from 0 to 130 km/h. will be more objects occluded in the images which are hard
to detect. With the challenges of long-range detection and
Other Variations. Besides above variations, we also pro- detection in crowded scenes, we hope that future work can
vide many other rare driving data such as adverse weather be encouraged to push performance higher on our dataset.
and lighting (e.g., rainy, foggy and night), car accidents, ve-
hicles that run over the red light, speed over the limit and 4.2. 3D Object Detection Evaluation
change the lane aggressively, children and adults jogging Baselines. We use LiDAR-based 3D object detection meth-
and running. Though these cases happen in the real world, ods such as PointRCNN [49], PointPillars [26], SECOND
they barely exist in existing datasets. To build robust per- [70] as baselines. See supp. for implementation details.
ception systems, it is important to include these rare scenar- Results on AIODrive with Depth Point Clouds. To reach
ios in the dataset. As an example, we show the pedestrian the best performance we can, we first use our densest point
speed in Figure 7 (c), which contains jogging and running cloud (i.e., depth point cloud) as input to baselines. As our
people. See supplementary for details of other variations. point clouds have a loner range than prior datasets such as
4. Experiments KITTI, we change the input point cloud range of detectors
To enable comparison with future work, we benchmark from 0-70m in frontal direction used in KITTI to 120m for
several baselines for 2D-3D object detection on our dataset. all directions, to enable perception at a larger range.
For evaluation protocol and data split, please refer to sup- We summarize the results in Table 5. We can see that all
plementary. Our code and training data will be released so 3D detection baselines achieve reasonable performance on
that AIODrive can be used to benchmark future methods, our AIODrive dataset. Also, performance tends to decrease
while a test set will remain private for fair comparisons. significantly from the ‘easy’ to the ‘moderate’ to the ‘hard’
level where the required detection range is increasing (see
4.1. 2D Object Detection Evaluation supp. for detailed evaluation protocol). Again, this shows
We use FPN [31] with a ResNet50 [19] backbone as the that detection at a longer range is harder than detection of
baseline. where the backbone is pre-trained on ImageNet nearby objects. We hope that our high-density long-range
7
Table 6: Quantitative results of 3D detection using point cloud with different densities in our AIODrive dataset.
Car Pedestrian Cyclist
Method Point Density (# of points)
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
PointRCNN [49] 100,000 (Velodyne-64 LiDAR p.c.) 74.98 72.73 53.85 45.31 37.37 34.66 56.95 50.70 42.96
600,000 (Long-range LiDAR p.c.) 76.74 75.17 69.76 56.39 50.14 40.38 58.71 52.37 46.83
1,000,000 (Long-range LiDAR p.c.) 77.71 77.26 71.17 58.16 51.92 43.81 59.64 52.61 47.73
4,000,000 (Depth p.c.) 78.13 77.99 73.63 58.73 53.71 44.74 59.03 53.85 49.36
1,000,000 (SPAD-LiDAR p.c.) 77.83 71.41 63.30 59.88 53.43 44.79 61.10 55.69 48.80
Table 7: 3D detection results on the real world KITTI dataset when training is augmented with our AIODrive dataset.
Car Pedestrian Cyclist
Method Training Data
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
PointRCNN [49] 250k frames AIODrive 65.32 46.21 39.38 24.57 19.04 18.32 40.93 30.41 26.68
KITTI 85.02 75.16 68.14 46.53 38.76 33.96 73.40 56.73 51.87
KITTI + 10k frames AIODrive 87.24 76.83 70.53 46.97 40.78 36.03 74.19 59.31 52.93
KITTI + 250k frames AIODrive 88.10 77.03 72.41 51.03 42.18 37.26 78.01 60.14 52.89
point clouds can be used to encourage future research to- formation in the raw 3D tensor data. Future work is needed
wards improving long-range 3D object detection. to fully leverage the SPAD-LiDAR data for 3D detection.
Effect of Point Cloud Density. In the above experiment, Results on Real-World KITTI Data. Lastly but also im-
we have benchmarked 3D detection performance of several portantly, we investigate if using our dataset can improve
baselines using our depth point clouds. To show effect of performance on the real data. To that end, we augment the
point cloud density, now we evaluate the same detector us- KITTI training data with the data from our dataset to train
ing point clouds with different density levels. We empha- PointRCNN [49]. This data augmentation is achieved by
size that this experiment is unique to our dataset as only equally (same number of frames) combining data from two
we provide (LiDAR and depth) point clouds with different datasets in every batch of training. In the case we have a
density levels, e.g., 100k, 600k, 1M, 4M points per frame. total of more frames from AIODrive than KITTI, we ran-
Also, we adapt PointRCNN and show the first 3D detection domly sample frames from AIODrive and still maintain an
baseline that works with SPAD-LiDAR point cloud inputs. equal number of frames from two datasets in every batch.
We summarize the results in Table 6. We can see that, We follow the KITTI evaluation on the test set and sum-
using (LiDAR and depth) point clouds with a higher den- marize the results in Table 7. We can see that PointRCNN
sity as input generally achieves higher performance, espe- trained with only KITTI data (the 2nd row) achieves simi-
cially in the ‘hard’ level which includes faraway objects up lar performance for car as reported in [49]. Also, PointR-
to 120m. This suggests that high-density long-range point CNN trained with only synthetic AIODrive data (the 1st
clouds could be helpful for improving 3D detection at a row) achieves lower performance on KITTI compared to
longer range. Also, for LiDAR and depth point clouds with trained with the KITTI data. This suggests that domain gap
different densities, we found that the differences of per- exists between two datasets. Importantly, when we aug-
formance in the ‘easy’ level are not significant (except for ment training data by combining data from two datasets (the
pedestrians). This shows that, for cars and cyclists, the main 3rd and 4th rows), we observed clear performance improve-
performance bottleneck of 3D detection at nearby range (up ments. This proves that our synthetic data can be used in
to 40 meters in the ‘easy’ level) may not be point cloud concert with real data to improve performance on the real
density but other factors such as model capacity. In con- data. Moreover, higher performance is achieved if more
trast, detection for nearby pedestrians can be significantly augmented frames (e.g., 250k vs. 10k frames) are used. The
improved using point clouds with a higher density. best performance is achieved when both KITTI and all data
We also note that we observed a different performance from AIODrive are used for training.
pattern when using SPAD-LiDAR (the last row in Table 6),
which tends to achieve higher performance for pedestrians 5. Conclusion
and cyclists (small objects) and lower performance for cars We proposed a dataset with the most diverse annotations,
(large objects). We hypothesize that the higher performance environmental variations and sensors. Our dataset can sup-
for small objects may be due to the larger fill factor of the port all mainstream perception tasks and innovate multi-
SPAD-LiDAR compared to APD-LiDAR (see supp. for de- task multi-sensor perception systems. Also, we confirmed
tails). However, it is not fully clear why performance drops that our high-density long-range point clouds can be used to
for cars. We hypothesize that it is because our method of us- improve long-range perception. To enable public compari-
ing SPAD-LiDAR by merging multiple point cloud returns son and encourage future research in long-range perception,
(see supp. for details) does not fully exploit multi-echo in- our full dataset and accompanying code will be released.
8
References [16] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Anto-
nio Lopez, and Vladlen Koltun. CARLA: An Open Urban
[1] High Definition Real-Time 3D Lidar. https : / / Driving Simulator. CoRL, 2017. 1, 3
velodynelidar.com/products/hdl-64e/. 2, 3,
[17] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are We
4
Ready for Autonomous Driving? the KITTI Vision Bench-
[2] ON Semiconductor to Demonstrate Long-range and In- mark Suite. CVPR, 2012. 2, 3, 4, 5, 6
Vehicle Automotive Imaging and Detection Technology.
[18] Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi,
https : / / www . onsemi . com / PowerSolutions /
Xavier Ricou, Rupesh Durgesh, Andrew S. Chung, Lorenz
newsItem.do?article=4444. 3, 5
Hauswald, Viet Hoang Pham, Maximilian Mühlegg, Se-
[3] Panasonic Develops Long-Range TOF Image Sensor. bastian Dorn, Tiffany Fernandez, Martin Jänicke, Sudesh
https://news.panasonic.com/global/press/ Mirashi, Chiragkumar Savani, Martin Sturm, Oleksandr
data/2018/06/en180619-3/en180619-3.html. Vorobiov, Martin Oelker, Sebastian Garreis, and Peter
3, 4 Schuberth. A2D2: Audi Autonomous Driving Dataset.
[4] The Alpha Prime Delivers Unrivaled Combination of arXiv:2004.06320, 2020. 3, 4, 5, 6
Field-of-View, Range, and Image Clarity. https : / /
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
velodynelidar.com/products/alpha- prime/.
Deep Residual Learning for Image Recognition. CVPR,
3, 4
2016. 7
[5] The OS2 Delivers Long-Range, High-Resolution 3D Sens-
[20] Felix Heide, Matthew O’Toole, Kai Zang, David B Lindell,
ing. https : / / ouster . com / products / os2 -
Steven Diamond, and Gordon Wetzstein. Non-Line-of-Sight
lidar-sensor/. 3, 4
Imaging with Partial Occluders and Surface Normals. ACM
[6] Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- Transactions on Graphics, 2019. 5
zel, Sven Behnke, Cyrill Stachniss, and Juergen Gall. Se-
[21] Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao,
manticKITTI: A Dataset for Semantic Scene Understanding
Dingfu Zhou, Peng Wang, Yuanqing Lin, and Ruigang
of LiDAR Sequences. ICCV, 2019. 1, 2, 4, 5, 6
Yang. The ApolloScape Dataset for Autonomous Driving.
[7] S. Bileschi. CBCL Streetscenes Challenge Framework, CVPRW, 2018. 1, 4, 5, 6
2007. 2
[22] Braden Hurl, Krzysztof Czarnecki, and Steven Waslander.
[8] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski,
Precise Synthetic Image and LiDAR (PreSIL) Dataset for
Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D.
Autonomous Vehicle Perception. IV, 2019. 2
Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin
[23] Juana Valeria Hurtado, Rohit Mohan, and Abhinav Valada.
Zhang, Jake Zhao, and Karol Zieba. End to End Learning
MOPT: Multi-Object Panoptic Tracking. arXiv:2004.08189,
for Self-Driving Cars. arXiv:1604.07316, 2016. 3
2020. 6
[9] Markus Braun, Sebastian Krebs, Fabian Flohr, and Dariu M.
Gavrila. The EuroCity Persons Dataset: A Novel Benchmark [24] Hiroaki Ishioka, Xinshuo Weng, Yunze Man, and Kris Ki-
for Object Detection. TPAMI, 2019. 4, 5, 6 tani. Single Camera Worker Detection, Tracking and Action
Recognition in Construction Site. ISARC, 2020. 5
[10] Gabriel J. Brostow, Julien Fauqueur, and Roberto Cipolla.
Semantic Object Classes in Video: A High-Definition [25] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh,
Ground Truth Database. Pattern Recognition Letters, 2009. and Steven Waslander. Joint 3D Proposal Generation and
2 Object Detection from View Aggregation. IROS, 2018. 3
[11] Rebecca Brown, Preston Hartzell, and Craig Glennie. Evalu- [26] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou,
ation of SPL100 Single Photon Lidar Data. Remote Sensing, Jiong Yang, and Oscar Beijbom. PointPillars: Fast Encoders
2020. 3, 5 for Object Detection from Point Clouds. CVPR, 2019. 7
[12] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, [27] Namhoon Lee, Xinshuo Weng, Vishnu Naresh Boddeti, Yu
Venice Erin Liong, and Qiang Xu. nuScenes: A Multimodal Zhang, Fares Beainy, Kris Kitani, and Takeo Kanade. Visual
Dataset for Autonomous Driving. CVPR, 2020. 1, 3, 4, 5, 6 Compiler: Synthesizing a Scene-Specific Pedestrian Detec-
[13] Ming-fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet tor and Pose Estimator. arXiv:1612.05234, 2016. 5
Singh, B Sławomir, Andrew Hartnett, De Wang, Peter Carr, [28] Yu-Jhe Li, Xinshuo Weng, and Kris Kitani. Learning Shape
Simon Lucey, Deva Ramanan, and James Hays. Argoverse: Representations for Person Re-Identification under Clothing
3D Tracking and Forecasting with Rich Maps. CVPR, 2019. Chang. WACV, 2021. 5
1, 3, 4, 5, 6 [29] Ming Liang, Bin Yang, Yun Chen, Rui Hu, and Raquel Urta-
[14] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo sun. Multi-Task Multi-Sensor Fusion for 3D Object Detec-
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe tion. CVPR, 2019. 3
Franke, Stefan Roth, and Bernt Schiele. The Cityscapes [30] Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urta-
Dataset for Semantic Urban Scene Understanding. CVPR, sun. Deep Continuous Fusion for Multi-Sensor 3D Object
2016. 2, 4, 5, 6 Detection. ECCV, 2018. 3
[15] Jia Deng, Wei Dong, Richard Socher, Li-jia Li, Kai Li, and [31] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Bharath Hariharan, and Serge Belongie. Feature Pyramid
Database. CVPR, 2009. 7 Networks for Object Detection. CVPR, 2017. 7
9
[32] Tsung Yi Lin, Michael Maire, Serge Belongie, James Hays, [48] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish
Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Kapoor. AirSim: High-Fidelity Visual and Physical Simu-
Zitnick. Microsoft COCO: Common Objects in Context. lation for Autonomous Vehicles. Field and Service Robotics,
ECCV, 2014. 7 2017. 3
[33] David B Lindell, Matthew O’Toole, and Gordon Wetzstein. [49] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. PointR-
Single-Photon 3D Imaging with Deep Sensor Fusion. ACM CNN: 3D Object Proposal Generation and Detection from
Transactions on Graphics, 2018. 5 Point Cloud. CVPR, 2019. 3, 7, 8
[34] David B Lindell, Gordon Wetzstein, and Matthew O’Toole. [50] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
Wave-Based Non-Line-of-Sight Imaging Using Fast FK Mi- Fergus. Indoor Segmentation and Support Inference from
gration. ACM Transactions on Graphics, 2019. 5 RGBD Images. ECCV, 2012. 2
[35] Reza Mahjourian, Martin Wicke, and Anelia Angelova. Un- [51] Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao.
supervised Learning of Depth and Ego-Motion from Monoc- SUN RGB-D: A RGB-D Scene Understanding Benchmark
ular Video Using 3D Geometric Constraints. CVPR, 2018. Suite. CVPR, 2015. 2
6 [52] Xiao Song, Chaoqin Huang, Zhidong Deng, Jianping Shi,
[36] Aashi Manglik, Xinshuo Weng, Eshed Ohn-bar, and Kris M and Bolei Zhou. DrivingStereo: A Large-Scale Dataset for
Kitani. Forecasting Time-to-Collision from Monocular Stereo Matching in Autonomous Driving Scenarios. CVPR,
Video: Feasibility, Dataset, and Challenges. IROS, 2019. 2019. 4, 5, 6
3 [53] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
[37] M Matthias, Casser Jean, Lahoud Neil, and C V Mar. Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
Sim4CV: A Photo-Realistic Simulator for Computer Vision Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han,
Applications. IJCV, 2018. 3 Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et-
[38] Maxim Maximov, Kevin Galim, and Laura Leal-Taixé. Fo- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang,
cus on defocus: bridging the synthetic to real domain gap for Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov.
depth estimation. CVPR, 2020. 2 Scalability in Perception for Autonomous Driving: Waymo
[39] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Open Dataset. CVPR, 2020. 1, 3, 4, 5, 6
Peter Kontschieder. The Mapillary Vistas Dataset for Se- [54] Xi Sun, Xinshuo Weng, and Kris Kitani. When We First
mantic Understanding of Street Scenes. ICCV, 2017. 2, 4, 5, Met: Visual-Inertial Person Localization for Co-Robot Ren-
6 dezvous. IROS, 2020. 5
[40] Abhishek Patil, Srikanth Malla, Haiming Gang, and Yi-Ting [55] Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon
Chen. The H3D Dataset for Full-Surround 3D Multi-Object Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger,
Detection and Tracking in Crowded Urban Scenes. ICRA, and Bastian Leibe. MOTS: Multi-Object Tracking and Seg-
2019. 1, 3, 4, 5, 6 mentation. CVPR, 2019. 6
[41] Quang-hieu Pham, Pierre Sevestre, Ramanpreet Singh [56] Peng Wang, Xinyu Huang, Xinjing Cheng, Dingfu Zhou,
Pahwa, Huijing Zhan, Chun Ho Pang, Yuda Chen, Armin Qichuan Geng, and Ruigang Yang. The ApolloScape
Mustafa, Vijay Chandrasekhar, and Jie Lin. A*3D Dataset: Open Dataset for Autonomous Driving and its Application.
Towards Autonomous Driving in Challenging Environments. TPAMI, 2019. 2, 4, 5, 6
ICRA, 2020. 3, 4, 5, 6 [57] Sen Wang, Daoyuan Jia, and Xinshuo Weng. Deep
[42] Matthew Pitropov, Danson Garcia, Jason Rebello, Michael Reinforcement Learning for Autonomous Driving.
Smart, Carlos Wang, Krzysztof Czarnecki, and Steven arXiv:1811.11329, 2018. 2
Waslander. Canadian Adverse Driving Conditions Dataset. [58] Yongxin Wang, Kris Kitani, and Xinshuo Weng. Joint Ob-
arXiv:2001.10117, 2020. 1, 3, 4, 5, 6 ject Detection and Multi-Object Tracking with Graph Neural
[43] Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and Networks. arXiv:2006.13164, 2020. 5
Leonidas J. Guibas. Frustum PointNets for 3D Object De- [59] Zhixin Wang and Kui Jia. Frustum ConvNet: Sliding Frus-
tection from RGB-D Data. CVPR, 2018. 3 tums to Aggregate Local Point-Wise Features for Amodal
[44] Stephan R. Richter, Zeeshan Hayder, and Vladlen Koltun. 3D Object Detection. IROS, 2019. 3
Playing for Benchmarks. ICCV, 2017. 2, 3 [60] Xinshuo Weng and Kris Kitani. Monocular 3D Object De-
[45] German Ros, Laura Sellart, Joanna Materzynska, David tection with Pseudo-LiDAR Point Cloud. ICCVW, 2019. 3
Vazquez, and Antonio Lopez. The SYNTHIA Dataset: A [61] Xinshuo Weng and Kris Kitani. AutoSelect: Automatic and
Large Collection of Synthetic Images for Semantic Segmen- Dynamic Detection Selection for 3D Multi-Object Tracking.
tation of Urban Scenes. CVPR, 2016. 2, 4, 5, 6 arXiv:2012.05894, 2020. 5
[46] Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and [62] Xinshuo Weng, Jianren Wang, David Held, and Kris Kitani.
William T. Freeman. LabelMe: A Database and Web-Based 3D Multi-Object Tracking: A Baseline and New Evaluation
Tool for Image Annotation. IJCV, 2008. 2 Metrics. IROS, 2020. 3, 5
[47] Ahmad El Sallab, Ibrahim Sobh, Mohamed Zahran, and Mo- [63] Xinshuo Weng, Jianren Wang, Sergey Levine, Kris Kitani,
hamed Shawky. Unsupervised Neural Sensor Models for and Nick Rhinehart. 4D Forecasting: Sequantial Forecasting
Synthetic LiDAR Data Augmentation. NeurIPSW, 2019. 2 of 100,000 Points. ECCVW, 2020. 6
10
[64] Xinshuo Weng, Jianren Wang, Sergey Levine, Kris Kitani,
and Nick Rhinehart. Inverting the Pose Forecasting Pipeline
with SPF2: Sequential Pointcloud Forecasting for Sequential
Pose Forecasting. CoRL, 2020. 5
[65] Xinshuo Weng, Yongxin Wang, Yunze Man, and Kris Kitani.
GNN3DMOT: Graph Neural Network for 3D Multi-Object
Tracking with 2D-3D Multi-Feature Learning. CVPR, 2020.
3
[66] Xinshuo Weng, Shangxuan Wu, Fares Beainy, and Kris Ki-
tani. Rotational Rectification Network: Enabling Pedestrian
Detection for Mobile Vision. WACV, 2018. 5
[67] Xinshuo Weng, Ye Yuan, and Kris Kitani. Joint 3D Tracking
and Forecasting with Graph Neural Network and Diversity
Sampling. arXiv:2003.07847, 2020. 5
[68] Shangxuan Wu and Xinshuo Weng. Image Labeling with
Markov Random Fields and Conditional Random Fields.
arXiv:1811.11323, 2018. 5
[69] Chen Xiaozhi, Ma Huimin, Wan Ji, Li Bo, and Xia Tian.
Multi-View 3D Object Detection Network for Autonomous
Driving. CVPR, 2017. 3
[70] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely Embed-
ded Convolutional Detection. Sensors, 2018. 7
[71] Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Ji-
aya Jia. STD: Sparse-to-Dense 3D Object Detector for Point
Cloud. ICCV, 2019. 3
[72] Senthil Yogamani, Ciaran Hughes, Jonathan Horgan, Ganesh
Sistu, Padraig Varley, Derek O’Dea, Michal Uricar, Stefan
Milz, Martin Simon, Karl Amende, Christian Witt, Hazem
Rashed, Sumanth Chennupati, Sanjaya Nayak, Saquib Man-
soor, Xavier Perroton, and Patrick Perez. WoodScape: A
Multi-Task, Multi-Camera Fisheye Dataset for Autonomous
Driving. ICCV, 2019. 1
[73] Kai Zhang, Jiaxin Xie, Noah Snavely, and Qifeng Chen.
Depth Sensing Beyond LiDAR Range. CVPR, 2020. 1, 3, 4
11