Autonomous Robot Operation
Autonomous Robot Operation
and Technology
Mechanical Engineering
Special Project Report
01/10/2023
1. Introduction
Figure 1 illustrates the proposed algorithm for the task. The architecture is composed
of two main phases. In the first stage, the fixed-camera image is analyzed in search of the
object of interest in the workspace using a pre-trained CNN. The neural network returns
a set of possible targets with their confidence and bounding box corners coordinates. The
center coordinates of the largest-confidence target are converted into real-world robot
cartesian coordinates that are subsequently sent to the manipulator through Ethernet
communication. The second phase starts as the end effector is brought close to the object,
having the in-hand camera detect the orientation of the ChAruco board on which the
object stands. This orientation is converted to Roll-Pitch-Yaw angles and is used to
position the end effector normal to the ChAruco board surface. Finally, the ChAruco itself
is used to acquire the depth of the target, and the object position is detected by a CNN
model, allowing for the robot’s gripper to grasp the target.
To start with the first stage of the algorithm and approach to the target, object
detection techniques are employed. Object detection is the task of detecting instances of
objects of a specific class within an image or video [1]. It locates objects that exist in an
image and encloses them inside a bounding box with their corresponding types or class
labels attached.
Algorithms for object detection are a combination of two tasks that are image
classification and object localization.
Image classification algorithms can predict the class or type of an object that is in the
image based on a predefined set of classes that the algorithm previously trained. For
example, given an image with a single object as input, as seen in Figure 2, the output
generated will be a class or a label of the corresponding object and the probability of the
prediction.
Object localization algorithms enclose an object in the image within a bounding box.
Again, we have an image with one or more objects as input. However, this time the output
will be the location of the bounding boxes using their position, height, and width. The
differences between them can be appreciated in the figure below.
Figure 2. Differences between image classification, object localization, and object detection,
respectively.
The problem of detecting and localizing the object can be solved using object
detection algorithms such as R-CNN [2], Fast R-CNN [3] or YOLO [4]. In the present
work, a variation of the YOLO network is employed to perform the aforementioned task.
YOLO stands for You Only Look once and is one of the most popular models used in
object detection and computer vision. This algorithm uses a neural network-based
approach to make predictions on the input images, achieving results with high accuracy
and faster than other approaches.
The many components of object detection are combined into one neural network by
YOLO. The network predicts each bounding box using features from the entire image.
Additionally, it simultaneously predicts all bounding boxes for a picture across all classes.
This implies that the network considers the entire image and all its objects when making
decisions. The YOLO design maintains excellent average precision while enabling end-
to-end training and real-time speeds. The input image is divided into an S x S grid by the
system. A grid cell detects an object if its center falls within that grid cell. Each grid cell
predicts B bounding boxes and their corresponding confidence scores. These confidence
scores reflect how confident the model is that the box contains an object and how accurate
it thinks the box is that it predicts. The confidence score should be zero if there is no
object present in that cell. Otherwise, the desired confidence score is given by the
intersection over union (IOU) between the predicted box and the ground truth (Figure 3).
A simplified diagram of the overall process can be appreciated in Figure 4 [4].
After labeling, the dataset was divided into 324 images for training and 40 images for
validation. The training was carried out using the YOLOv5 custom training notebook
available with GoogleColab [5]. The performance of the trained model is measured by
mAP or Mean Precision average. mAP is equal to the average of the Average Precision
metric across all classes in a model. mAP can be used to compare both different models
on the same task and different versions of the same model. mAP is measured between 0
and 1 [8]. The following chart resumes the results of our model.
The load of the model was done by using the method in [9]. This method retrieves:
With the coordinates obtained with the model, we were able to draw the bounding
boxes of the objects and roughly find the position of the centroid in pixels. Then, the
position of the centroid is transformed to real world coordinates with a technique
described later in this paper. An overall visual representation of the data obtained can be
seen in the figure below.
Once the center of the object of interest is detected in the first stage, the pixel
coordinates given from the fixed camera are needed to be converted into real world
measurements in robot coordinates, we may represent the projection from 3D points in
the world to 2D points in the image plane of a camera as:
𝑿𝑿𝑾𝑾
𝒖𝒖 𝒇𝒇𝒙𝒙 𝟎𝟎 𝒑𝒑𝒙𝒙 𝟎𝟎 𝑪𝑪 𝑪𝑪
𝑹𝑹 𝒕𝒕 𝒀𝒀
�𝒗𝒗� = � 𝟎𝟎 𝒇𝒇𝒚𝒚 𝒑𝒑𝒚𝒚 𝟎𝟎� � 𝑾𝑾 𝑾𝑾 � � 𝑾𝑾 �
𝟎𝟎 𝟏𝟏 𝒁𝒁𝑾𝑾 (1)
𝟏𝟏 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟏𝟏
Where u and v are the pixel coordinates given from the camera, fx, fy, px, and py are
the focal length of the camera in the x-axis, y-axis, and the principal point in the x-axis,
and y-axis, respectively, all the parameters inside this matrix are called intrinsic camera
parameters and are known from a previous camera calibration using the method
𝐶𝐶 𝐶𝐶
explained in [10]. �𝑹𝑹𝑊𝑊 𝒕𝒕𝑊𝑊 � represents the extrinsic camera parameters, being a linear
0 1
transformation from world coordinates to camera coordinates, a variation of the method
presented in [11] is used in this work to acquire this transformation. Finally, XW, YW, and
ZW represent world coordinates. Since the camera is calibrated and its pose and height to
the table are fixed, the equation only has 2 degrees of freedom; only XW and YW can vary
when the object moves around the table. Thus, the process to map a (u, v) pixel coordinate
to a real-world (XW, YW, ZW) coordinate on the table is straightforward, the result of the
transformation can be seen in Figure 7.
The purpose of the hand-eye calibration problem is to find the transformation between
the coordinates of the in-hand camera and the robot coordinate system. The essence is to
solve the problem 𝑨𝑨𝑨𝑨 = 𝑿𝑿𝑿𝑿 (See Figure 8 for reference), where X is the transformation
from the camera coordinate system to the robot coordinate system. As shown in the
formula below, the transformation can be solved using the robot pose transformation and
camera pose transformation from the target.
𝒈𝒈 𝒈𝒈 𝒄𝒄(𝟐𝟐)
(𝑻𝑻𝑾𝑾(𝟐𝟐) )−𝟏𝟏 𝑻𝑻𝑾𝑾(𝟏𝟏)
𝒄𝒄(𝟏𝟏) −𝟏𝟏
𝒈𝒈 𝒈𝒈 𝑻𝑻𝒄𝒄 = 𝑻𝑻𝒄𝒄 𝑻𝑻𝒕𝒕 (𝑻𝑻𝒕𝒕 ) (2)
𝒈𝒈
Then let: 𝑨𝑨 = (𝑻𝑻𝑾𝑾(𝟐𝟐) )−𝟏𝟏 𝑻𝑻𝑾𝑾(𝟏𝟏)
𝒄𝒄(𝟐𝟐) 𝒄𝒄(𝟏𝟏) −𝟏𝟏
𝒈𝒈 𝒈𝒈 , 𝑩𝑩 = 𝑻𝑻𝒕𝒕 (𝑻𝑻𝒕𝒕 ) , 𝑿𝑿 = 𝑻𝑻𝒄𝒄 finally:
𝑨𝑨𝑨𝑨 = 𝑿𝑿𝑿𝑿
Once the manipulator brings the hand-eye camera close to the target, the ChAruco
board below the target is detected and its pose is estimated by using Perspective N Point
(PnP). Since the object of interest can be at any point on the ChAruco surface, and its
frame orientation is at fixed angles to the ChAruco, we are only interested in the surface
orientation of the ChAruco, and the object position is going to be acquired with our
trained CNN model. Once the ChAruco orientation is acquired solving the PnP problem,
we acquire the rotation existing between the hand-eye camera frame to the target frame
𝑅𝑅𝐶𝐶𝑇𝑇 , we transform this rotation to relate the gripper end-effector frame with the target
frame by the expression 𝑅𝑅𝐺𝐺𝑇𝑇 = 𝑅𝑅𝐶𝐶𝑇𝑇 𝑅𝑅𝐺𝐺𝐶𝐶 . Since we cannot move the robot in end-effector
frames, we have to apply the so-called similarity transform which allows to convert a
given linear transformation in the camera frame to the same linear transformation in the
world robot-frame, the similarity transform is expressed in (3). Once the rotation matrix
to position the gripper normal to the surface of the ChAruco is obtained, we parametrize
it with the roll-pitch-yaw representation.
𝐺𝐺 −1 𝐺𝐺
𝑅𝑅𝑤𝑤 = (𝑅𝑅𝑊𝑊 ) 𝑅𝑅𝐶𝐶 𝑅𝑅𝑊𝑊 (3)
Figure 9. Linear transformations between frames.
Once the gripper is normal to the surface of the ChAruco, as shown in Figure 10,
the CNN model is used again to detect the position of the target in pixels. Using the
previously introduced method, the same mapping is done from (u, v) pixel coordinates
into (𝑋𝑋𝐶𝐶 , 𝑌𝑌𝐶𝐶 , 𝑍𝑍𝐶𝐶 ) camera coordinates, where the 𝑍𝑍𝐶𝐶 component is extracted from the
ChAruco board. The linear transformation from camera coordinates into world
coordinates is now used, as shown below
𝑷𝑷𝑾𝑾 = 𝑻𝑻𝑾𝑾
𝑪𝑪 𝑷𝑷
𝑪𝑪
𝑷𝑷𝑾𝑾 = 𝑻𝑻𝑾𝑾 𝑮𝑮 𝑪𝑪
𝑮𝑮 𝑻𝑻𝑪𝑪 𝑷𝑷
𝑷𝑷𝑪𝑪 represents camera coordinates of the target, 𝑷𝑷𝑾𝑾 the world coordinates of the target,
𝑻𝑻𝑮𝑮𝑪𝑪 is the homogeneous transformation from camera coordinates to end-effector
coordinates, which is known after solving the hand-eye calibration problem.
Figure 10. Orientation of the gripper normal to the surface and reposition towards the object of interest.
Since the orientation and coordinates of the object are now known, the object
can be grasped and moved without much trouble.
The computer and the iRX6 digital servo controller (robot controller) are linked
together by Ethernet communication. The computer runs a Python code that implements
the object detection model, the 2D pose estimation algorithm, and interacts with external
devices such as the fixed and hand-eye cameras, the 6DOF robot manipulator and the
gripper. The iRX6 receives commands or messages from the computer to action the
manipulator towards the target with the desired pose to perform the grasp. Figure 11
shows the overall interface between the computer and the robot controller.
As seen in the figure above, initially, the manipulator awaits instructions from the
computer. Once the image from the fixed camera was processed and the object was
detected, its robot coordinates were acquired and sent to the robot servo controller to
approach the target. Immediately after, the ChAruco board pose was detected and used to
orient the end-effector using the image from the hand-in camera. Then, the 3D position
of the object was detected having the depth acquired solving the PNP problem using
landmarks from the ChAruco. Since the pose of the object is defined, the manipulator is
then able to grasp the target and place it on the desired position. Finally, the robot arm
returns to its initial position, ending the process.
4. Conclusion
The employment of the ChAruco board alongside CNN object detection proved to be
feasible in performing pick-and-place tasks. However, the sequence of motions is not
smooth enough to compete with other methods using depth sensor devices such as in [13].
In the near future we hope to update the method presented in this project to require
less intrusive landmarks near the target, enhancing its versatility and applications, while
retaining its performance and inexpensiveness.
References
[4] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You only look once:
Unified, real-time object detection," 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.
[10] Z. Zhang, "A flexible new technique for camera calibration," IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 11,
2000.
[11] G. An, S. Lee, M.-W. Seo, K. Yun, W.-S. Cheong and S.-J. Kang,
"Charuco board-based omnidirectional camera calibration method,"
Electronics, vol. 7, no. 12, 2018.
[12] T. A. Myhre, "Robot camera calibration," [Online]. Available:
https://www.torsteinmyhre.name/snippets/robcam_calibration.html.
[13] T.-T. Le, T.-S. Le, Y.-R. Chen, J. Vidal and C.-Y. Lin, "6d pose estimation
with combined deep learning and 3D vision techniques for a fast and accurate
object grasping," Robotics and Autonomous Systems, vol. 141, 2021.