3D Object Reconstruction A Comprehensive View Dependent Da - 2024 - Data in Bri
3D Object Reconstruction A Comprehensive View Dependent Da - 2024 - Data in Bri
Data in Brief
Data Article
a r t i c l e i n f o a b s t r a c t
Article history: The dataset contains RGB, depth, segmentation images of the
Received 13 March 2024 scenes and information about the camera poses that can be
Revised 15 May 2024 used to create a full 3D model of the scene and develop
Accepted 24 May 2024 methods that reconstruct objects from a single RGB-D cam-
Available online 2 June 2024
era view. Data were collected in the custom simulator that
Dataset link: A Comprehensive
loads random graspable objects and random tables from the
View-Dependent Dataset for Objects ShapeNet dataset. The graspable object is placed above the
Reconstruction - Kinect Azure Set (Original table in a random position. Then, the scene is simulated us-
data) ing the PhysX engine to make sure that the scene is phys-
Dataset link: A Comprehensive ically plausible. The simulator captures images of the scene
View-Dependent Dataset for Objects from a random pose and then takes the second image from
Reconstruction - Synthetic Set Part A the camera pose that is on the opposite side of the scene.
(Original data)
The second subset was created using Kinect Azure and a set
Dataset link: A Comprehensive
of real objects located on the ArUco board that was used to
View-Dependent Dataset for Objects
Reconstruction - Synthetic Set Part B estimate the camera pose.
(Original data) © 2024 The Authors. Published by Elsevier Inc.
This is an open access article under the CC BY license
Keywords: (http://creativecommons.org/licenses/by/4.0/)
Robotics
RGB-D camera
Depth images
Single-view scene reconstruction
Scene segmentation
Grasping objects
∗
Corresponding author.
E-mail addresses: [email protected] (R. Staszak), [email protected] (D. Belter).
https://doi.org/10.1016/j.dib.2024.110569
2352-3409/© 2024 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/)
2 R. Staszak and D. Belter / Data in Brief 55 (2024) 110569
Specifications Table
• This dataset was collected in a controlled environment and it provides ground truth RGB and
depth images of the scenes.
• The dataset also contains real RGB-D images from the Kinect Azure captured with the camera
pose while the sensor was moving around the scene.
• The dataset also contains ground truth synthetic data related to object segmentation and
object position.
• The dataset can be used to reconstruct occluded parts of the objects and the scene.
• The dataset can also be used for scene segmentation, object detection, and pose estimation
using RGB-D images.
• The dataset might be used in practical tasks combining different aspects of object reconstruc-
tion and detection, domain adaptation, robotic manipulation, and synthetic-to-real transfer
learning.
2. Background
The motivation behind creating the dataset stems from the challenges that robots face in
perceiving and reconstructing objects from a single viewpoint. This limitation often leads to in-
complete shape information about the objects, negatively impacting the effectiveness of grasping
R. Staszak and D. Belter / Data in Brief 55 (2024) 110569 3
Fig. 1. Example images from the dataset for a random scene. The resolution of the images is 640 × 480 (a–k) and
129 × 96 (l–u). See the text for a detailed explanation.
methods. The incomplete or partial object models limit the robotʼs ability to interact with ob-
jects in any given environment. To address this issue, we developed a dataset that contains RGB-
D images of the scene from the input pose and the pose on the opposite side of the scene. Both
RGB-D pairs of images and camera poses can be used to create a 3D model of the scene (point
cloud). The motivation for creating the dataset lies in improving the perceptual capabilities of
robots, specifically in the context of single-view object reconstruction, with the aim of enhanc-
ing their ability to interact with and manipulate objects in various environments. By creating
a dataset specifically tailored to the challenges posed by single-view object reconstruction, the
goal is to enable robots equipped with RGB-D cameras to overcome the limitations associated
with incomplete shape information [1] (Fig. 1).
3. Data Description
The first subset of the dataset represents computer-generated scenes, where a single object,
which stands on a table in a random position, is captured from various viewpoints. The dataset
contains synthetic RGB-D images and corresponding 3D camera poses [2,3]. The pairs of RGB-D
images are generated for the random initial pose of the camera and the pose on the other side
of the scene. The images stored in the dataset are presented in Fig. 2, where we show the RGB
image for the random pose of the camera (a), the RGB image aligned with the depth camera
Fig. 2. Example RGB (a), object mask (b), and depth images (c) from the dataset of real objects.
4 R. Staszak and D. Belter / Data in Brief 55 (2024) 110569
Fig. 3. Illustration of the virtual camera pose (right) that is on the opposite side of the scene to the input view (left).
The translation Tz is constant and the virtual surface of the mirror does not have to be in the center or aligned to the
surface of the object.
pose (b), the RGB image from the pose of the camera located on the opposite side of the scene
(c), RGB image obtained by projecting the point cloud from the random camera pose on the pose
of the camera located on the other side of the scene (d), corresponding depth images (e–h), and
corresponding segmentation images (i-k). We also store 128×96 px patches cropped from the
original scene that contain single objects (Fig. 2–u).
The images stored in the dataset are grouped according to the category name. Inside each
folder, the images are identified according to the content (RGB, depth, segment, RGBprojected,
depthProjected). The number of the image represents the identifier of the camera pose. The
camPoses.csv file contains information about camera poses used to collect the data set. Each
image name contains the identifier of the camera pose that is represented as a single row in
the CSV file. In each row we store the identifier of the camera pose, the identifier of the initial
random camera pose, and row-wise elements of the homogenous transformation matrix related
to the camera pose. Moreover, for each scene, the dataset provides an objects.dat file that con-
tains the names of the objects on the images, instances identifiers from the ShapeNet dataset, a
region identifier on the segmentation images, image coordinates, and row-wise elements of the
homogenous transformation matrix related to the object’s pose.
The second subset [4] utilizes a set of eight YCB objects [5] augmented with two objects -
a bottle and a wooden box. Single objects are put on a 60 × 40 cm board with Aruco markers
and Multiple shots from different viewpoints are taken around them. To capture data, the Kinect
Azure DK sensor was employed due to its superior depth data quality and density when com-
pared to other cameras [6]. Example images are presented in Fig. 3. The dataset contains the
following files:
- RGB images (“rgb_” prefix) - RGB images of the scene, resolution: 1920×1080,
- camera poses in the board frame (“board_” prefix) – homogeneous matrix in row-major or-
der,
- correction of the camera poses (“correction_” prefix) – homogeneous matrix in row-major
order,
R. Staszak and D. Belter / Data in Brief 55 (2024) 110569 5
In Fig. 3 we illustrate the data acquisition process. The initial and random camera pose is
located randomly around the table. The inclination of the camera is randomly drawn from the
range of 0.7–0.9 rad, the yaw angle is drawn from the range of - to . The position of the cam-
era is computed from these spherical coordinates assuming that the radiance is equal to 2.5 m.
The "mirror" pose of the camera shown on the right in Fig. 3 is computed relatively to the plane
associated with the front surface of the object, represented by the blue plane in Fig. 3. We keep a
relatively constant distance between the virtual camera pose by employing a constant translation
equal to 1.3 m, denoted as Tz . The mirror camera pose, depicted on the right in Fig. 3, is derived
through the application of defined homogeneous transformations to the input camera pose:
T = T f T−1 Tyaw= T Tz−1 ,
where Tf is the transformation along the z-axis of the camera that depends on the distance
between the current camera pose and the front surface of the object, T is the transformation
that compensates current inclination (pitch angle) of the camera, and Tyaw= is the transfor-
mation that moves the camera to the other side of the scene that is 1.3 m distant from the front
surface of the object (Tz transformation). Finally, the virtual camera pose depends only on the
tilt of the camera and the distance between the camera and the front surface of the object dz .
The other parameters are fixed. Even though we assume that the virtual camera pose is 1.3 m
distant from the front surface of the object, it may vary due to imprecise front surface esti-
mation.
On the table, we place one random instance of an object selected from a random set of cat-
egories (bottle, camera, can, jar, laptop, mug, bowl, box from the ShapeNet [7]). The horizontal
position of the object is randomly selected from a uniform distribution in the range of −0.25
to 0.25 m. Upon selecting a table randomly for a given object, the data generation procedure
starts for the particular scene. Moreover, the scenes are constrained by a surface that represents
a floor. The geometrical size of objects is scaled down by a factor of 0.25, while the tables
have been scaled up by a factor of 1.5 to mimic natural proportions. The scene dimensions lie
between3.0 m and 4.0 m depending on the assumed viewpoint. Depth images are generated
using an OpenGL graphics engine and the pinhole camera model. The parameters of the camera
are based on the perfect camera model (fx =fy =525, Cx =320, Cy =240). Moreover, we generate
reference depth image and depth image that is created assuming that RGB and depth cameras
are not aligned like in real RGB-D cameras. The translation between these cameras is equal to
tx =−0.051, ty =0.001, and tz =−0.001 and they are obtained from calibrating our Kinect Azure
camera.
The subset containing images from the Kinect Azure (fx=912.37, fy=912.25, Cx=961.44,
Cy=548.29) was collected by moving the camera above a set of real objects located on the ArUco
60 × 40 cm board that was used to estimate the camera pose. To estimate the camera poses for
6 R. Staszak and D. Belter / Data in Brief 55 (2024) 110569
the objects from the subset of real objects, we used the ArUco board, which consists of a grid of
7 × 5 unique ArUco markers. We placed the object on the ArUco board and utilized the OpenCV
library methods to estimate the camera poses [8] located around the object and the marker.
The obtained views from multiple poses can be used to construct a 3D point cloud model
of the observed object, subsequently employing it to generate RGB and depth images based on
the provided camera pose. This process involves merging point clouds generated from various
viewpoints surrounding the observed objects. The precision of the acquired 3D model of the
object is heavily dependent on the accuracy of the camera pose estimation. Even a slight de-
viation can lead to sets of point clouds from different viewpoints that do not perfectly align,
especially when relying solely on RGB data. To address this, we utilize the Broyden–Fletcher–
Goldfarb–Shanno (BFGS) algorithm [9] to refine the board poses based on depth measurements.
The corners of detected ArUco markers are assigned to corresponding points in space based
on the depth data. Hence, it is possible to refine the initial board pose by minimizing the dis-
tance between the known board layout to the corresponding points in space. Despite the initial
ArUco-based localization and subsequent BFGS-based correction, the resulting point clouds may
not align flawlessly. Consequently, a manual alignment of the point clouds in 3D space is per-
formed to achieve the ground-truth models. The generation of views from arbitrary viewpoints
is possible by merging selected samples, which are assigned to a particular object. The selection
involves reducing the number of dataset views used for the synthesis of a partial point cloud
by comparing the cosine distance between the z-vectors of the newly defined camera viewpoint
and the dataset camera viewpoints. Then, the data samples can be sorted in descending order of
the obtained distances and the first few occurrences are used to obtain RGB and depth images
for the given camera pose.
Limitations
The dataset focuses on depth data. The RGB images in the synthetic dataset are generated
using very simple renderers and are used for visualization purposes and should not be used for
training neural networks.
Ethics Statement
We confirm that the current work does not involve human subjects, animal experiments, or
any data collected from social media platforms.
Rafał Staszak: Software, Validation, Real-world Experiments, Reviewing and Editing Dominik
Belter: Simulation software, Original draft preparation, Supervision
Data Availability
Acknowledgments
The work was supported by the National Science Centre, Poland, under research project no
UMO-2019/35/D/ST6/03959.
The authors declare that they have no known competing financial interests or personal rela-
tionships that could have appeared to influence the work reported in this paper.
References
[1] R. Staszak, B. Kulecki, W. Sempruch, D. Belter, What’s on the other side? in: A Single-View 3D Scene Reconstruction,
2022 17th International Conference on Control, Automation, Robotics and Vision (ICARCV), 2022, pp. 173–180.
[2] R. Staszak, D. Belter, A comprehensive view-dependent dataset for objects reconstruction - synthetic set part A,
Mendeley Data V2 (2024), doi:10.17632/z88tpm3926.2.
[3] R. Staszak, D. Belter, A comprehensive view-dependent dataset for objects reconstruction - synthetic set part B,
Mendeley Data V3 (2024), doi:10.17632/hy9wnbhr9w.3.
[4] R. Staszak, D. Belter, A comprehensive view-dependent dataset for objects reconstruction - kinect azure set, Mende-
ley Data V2 (2024), doi:10.17632/jd8w5r3ncw.2.
[5] B. Calli, A. Singh, J. Bruce, A. Walsman, K. Konolige, S. Srinivasa, P. Abbeel, A.M. Dollar, Yale-CMU-Berkeley dataset
for robotic manipulation research, Int. J. Rob. Res. 36 (3) (2017) 261–268.
[6] M. Tölgyessy, M. Dekan, L. Chovanec, P. Hubinskỳ, Evaluation of the azure kinect and its comparison to kinect v1
and kinect v2, Sensors 21 (2) (2021) 413.
[7] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva,S. Song, H. Su, J. Xiao, L. Yi,
F. Yu, ShapeNet: An information-rich 3D model repository, arXiv preprint arXiv:1512.03012, 2015.
[8] G. Bradski, The opencv library, Dr. Dobb’s J.: Softw. Tools Prof. Program. 25 (11) (20 0 0) 120–123.
[9] S.J. Wright, Numerical optimization, 2006.