0% found this document useful (0 votes)
66 views

Autonomous Robot Operation

This document is a special project report submitted by 4 students for their mechanical engineering degree. It details the development of a robot arm system capable of autonomous pick and place tasks using 2D computer vision. The system uses a convolutional neural network and homography to detect objects and estimate their 6D pose in order to accurately grasp objects. It first detects objects in images from a fixed camera using YOLO object detection, then orients the end effector and estimates depth using a camera on the end effector and ArUco markers on the target surface. The system was tested by picking up a computer mouse as the target object.

Uploaded by

Diego Correa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Autonomous Robot Operation

This document is a special project report submitted by 4 students for their mechanical engineering degree. It details the development of a robot arm system capable of autonomous pick and place tasks using 2D computer vision. The system uses a convolutional neural network and homography to detect objects and estimate their 6D pose in order to accurately grasp objects. It first detects objects in images from a fixed camera using YOLO object detection, then orients the end effector and estimates depth using a camera on the end effector and ArUco markers on the target surface. The system was tested by picking up a computer mouse as the target object.

Uploaded by

Diego Correa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

National Taiwan University of Science

and Technology
Mechanical Engineering
Special Project Report

Student Number: F11003106, F11003115, F11003114,


F11003108

Robot Arm Omnidirectional Autonomous


Pick and Place Task using 2D Vision

Name: Diego Correa


Enrique Miranda
Gonzalo Miltos
Mauricio Cristaldo
Advisor:Chyi-Yeu Lin

01/10/2023
1. Introduction

Through the development of Artificial Intelligence, it is a fact that industrial robotic


arms gain more strength when it comes to performing simple or even the most demanding
tasks in the safest environments as well as in the most complex ones.
6D pose estimation for pick-and-place tasks has been a vital research interest in
robotics since its conception. An accurate and fast algorithm for such a job is still to be
found. Moreover, most vision systems depend on a depth perception device that is costly,
and the technology is still developing.
This research aims to support the development of image-based visual servoing
techniques for industrial and service robots by creating a practical framework for pseudo-
6D pose estimation for a pick-and-place algorithm using eye-in-hand and eye-to-hand
configurations. The main idea developed in this work is to use a ChAruco board along
the object to be picked for the in-hand camera to detect its orientation and depth while
using Convolutional Neural Networks (CNNs) with homography to get its position, this
is equivalent to an in-hand 6D pose estimation if the object to be picked and ChAruco
board are at fixed angles to each other, this allows for an accurate and fast grasping.
2. Method

Figure 1 illustrates the proposed algorithm for the task. The architecture is composed
of two main phases. In the first stage, the fixed-camera image is analyzed in search of the
object of interest in the workspace using a pre-trained CNN. The neural network returns
a set of possible targets with their confidence and bounding box corners coordinates. The
center coordinates of the largest-confidence target are converted into real-world robot
cartesian coordinates that are subsequently sent to the manipulator through Ethernet
communication. The second phase starts as the end effector is brought close to the object,
having the in-hand camera detect the orientation of the ChAruco board on which the
object stands. This orientation is converted to Roll-Pitch-Yaw angles and is used to
position the end effector normal to the ChAruco board surface. Finally, the ChAruco itself
is used to acquire the depth of the target, and the object position is detected by a CNN
model, allowing for the robot’s gripper to grasp the target.

Figure 1. Proposed algorithm architecture.


2.1. Object Detection Problem

To start with the first stage of the algorithm and approach to the target, object
detection techniques are employed. Object detection is the task of detecting instances of
objects of a specific class within an image or video [1]. It locates objects that exist in an
image and encloses them inside a bounding box with their corresponding types or class
labels attached.

Algorithms for object detection are a combination of two tasks that are image
classification and object localization.

Image classification algorithms can predict the class or type of an object that is in the
image based on a predefined set of classes that the algorithm previously trained. For
example, given an image with a single object as input, as seen in Figure 2, the output
generated will be a class or a label of the corresponding object and the probability of the
prediction.

Object localization algorithms enclose an object in the image within a bounding box.
Again, we have an image with one or more objects as input. However, this time the output
will be the location of the bounding boxes using their position, height, and width. The
differences between them can be appreciated in the figure below.

Figure 2. Differences between image classification, object localization, and object detection,
respectively.

The problem of detecting and localizing the object can be solved using object
detection algorithms such as R-CNN [2], Fast R-CNN [3] or YOLO [4]. In the present
work, a variation of the YOLO network is employed to perform the aforementioned task.
YOLO stands for You Only Look once and is one of the most popular models used in
object detection and computer vision. This algorithm uses a neural network-based
approach to make predictions on the input images, achieving results with high accuracy
and faster than other approaches.

2.2. YOLO mechanism

The many components of object detection are combined into one neural network by
YOLO. The network predicts each bounding box using features from the entire image.
Additionally, it simultaneously predicts all bounding boxes for a picture across all classes.
This implies that the network considers the entire image and all its objects when making
decisions. The YOLO design maintains excellent average precision while enabling end-
to-end training and real-time speeds. The input image is divided into an S x S grid by the
system. A grid cell detects an object if its center falls within that grid cell. Each grid cell
predicts B bounding boxes and their corresponding confidence scores. These confidence
scores reflect how confident the model is that the box contains an object and how accurate
it thinks the box is that it predicts. The confidence score should be zero if there is no
object present in that cell. Otherwise, the desired confidence score is given by the
intersection over union (IOU) between the predicted box and the ground truth (Figure 3).
A simplified diagram of the overall process can be appreciated in Figure 4 [4].

Figure 3. IOU definition.


Figure 4. Simplified process of detection of objects by YOLO model. Image from [4].

Five predictions compose each bounding box: x, y, w, h, and confidence. The


coordinates of the center of the box relative to the bounds of the grid are represented by
“x” and “y”. The width and height, denoted as “w” and “h” respectively, are predicted
relative to the whole image. Finally, the confidence prediction represents the Intersection
Over Union (IOU) between the predicted box and any ground truth box. Each grid cell
also predicts C conditional class probabilities. These probabilities are conditioned on the
grid cell containing an object. The network is only capable of predict one set of class
probabilities per grid cell, regardless of the number of boxes B [4].
2.3. Creation of the computer mouse model
The target chosen to perform the pick and place task is a computer mouse. The process
of creating a model was performed using YOLOv5, which is one of the most recent
versions of the YOLO family [5].
The procedure began with the collection of the data to form the dataset. More than
150 pictures of the object were taken. Then augmentation, utilizing the Roboflow data
augmentation tool [6] was performed, yielding a dataset of 364 images. The next step was
labeling the bounding boxes of the class. Make Sense AI [7] was the tool that helped to
accomplish this task (Figure 5).
Figure 5. Screen capture of Make Sense AI tool.

After labeling, the dataset was divided into 324 images for training and 40 images for
validation. The training was carried out using the YOLOv5 custom training notebook
available with GoogleColab [5]. The performance of the trained model is measured by
mAP or Mean Precision average. mAP is equal to the average of the Average Precision
metric across all classes in a model. mAP can be used to compare both different models
on the same task and different versions of the same model. mAP is measured between 0
and 1 [8]. The following chart resumes the results of our model.

Figure 6. Chart with mAP score of the training.


2.4. The model in action

The load of the model was done by using the method in [9]. This method retrieves:

• ID for the object.


• Upper left and lower right corners of bounding boxes (in pixels).
• Confidence number.
• Class number.
• Class name.

With the coordinates obtained with the model, we were able to draw the bounding
boxes of the objects and roughly find the position of the centroid in pixels. Then, the
position of the centroid is transformed to real world coordinates with a technique
described later in this paper. An overall visual representation of the data obtained can be
seen in the figure below.

Figure 7. The model in action.

2.5. 2D pose estimation problem

Once the center of the object of interest is detected in the first stage, the pixel
coordinates given from the fixed camera are needed to be converted into real world
measurements in robot coordinates, we may represent the projection from 3D points in
the world to 2D points in the image plane of a camera as:
𝑿𝑿𝑾𝑾
𝒖𝒖 𝒇𝒇𝒙𝒙 𝟎𝟎 𝒑𝒑𝒙𝒙 𝟎𝟎 𝑪𝑪 𝑪𝑪
𝑹𝑹 𝒕𝒕 𝒀𝒀
�𝒗𝒗� = � 𝟎𝟎 𝒇𝒇𝒚𝒚 𝒑𝒑𝒚𝒚 𝟎𝟎� � 𝑾𝑾 𝑾𝑾 � � 𝑾𝑾 �
𝟎𝟎 𝟏𝟏 𝒁𝒁𝑾𝑾 (1)
𝟏𝟏 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟏𝟏

Where u and v are the pixel coordinates given from the camera, fx, fy, px, and py are
the focal length of the camera in the x-axis, y-axis, and the principal point in the x-axis,
and y-axis, respectively, all the parameters inside this matrix are called intrinsic camera
parameters and are known from a previous camera calibration using the method
𝐶𝐶 𝐶𝐶
explained in [10]. �𝑹𝑹𝑊𝑊 𝒕𝒕𝑊𝑊 � represents the extrinsic camera parameters, being a linear
0 1
transformation from world coordinates to camera coordinates, a variation of the method
presented in [11] is used in this work to acquire this transformation. Finally, XW, YW, and
ZW represent world coordinates. Since the camera is calibrated and its pose and height to
the table are fixed, the equation only has 2 degrees of freedom; only XW and YW can vary
when the object moves around the table. Thus, the process to map a (u, v) pixel coordinate
to a real-world (XW, YW, ZW) coordinate on the table is straightforward, the result of the
transformation can be seen in Figure 7.

2.6. Hand-eye calibration problem

The purpose of the hand-eye calibration problem is to find the transformation between
the coordinates of the in-hand camera and the robot coordinate system. The essence is to
solve the problem 𝑨𝑨𝑨𝑨 = 𝑿𝑿𝑿𝑿 (See Figure 8 for reference), where X is the transformation
from the camera coordinate system to the robot coordinate system. As shown in the
formula below, the transformation can be solved using the robot pose transformation and
camera pose transformation from the target.

𝑾𝑾(𝟏𝟏) 𝒈𝒈 𝒄𝒄(𝟏𝟏) 𝑾𝑾(𝟐𝟐) 𝒈𝒈 𝒄𝒄(𝟐𝟐)


𝑻𝑻𝒈𝒈 𝑻𝑻𝒄𝒄 𝑻𝑻𝒕𝒕 = 𝑻𝑻𝒈𝒈 𝑻𝑻𝒄𝒄 𝑻𝑻𝒕𝒕

𝒈𝒈 𝒈𝒈 𝒄𝒄(𝟐𝟐)
(𝑻𝑻𝑾𝑾(𝟐𝟐) )−𝟏𝟏 𝑻𝑻𝑾𝑾(𝟏𝟏)
𝒄𝒄(𝟏𝟏) −𝟏𝟏
𝒈𝒈 𝒈𝒈 𝑻𝑻𝒄𝒄 = 𝑻𝑻𝒄𝒄 𝑻𝑻𝒕𝒕 (𝑻𝑻𝒕𝒕 ) (2)

𝒈𝒈
Then let: 𝑨𝑨 = (𝑻𝑻𝑾𝑾(𝟐𝟐) )−𝟏𝟏 𝑻𝑻𝑾𝑾(𝟏𝟏)
𝒄𝒄(𝟐𝟐) 𝒄𝒄(𝟏𝟏) −𝟏𝟏
𝒈𝒈 𝒈𝒈 , 𝑩𝑩 = 𝑻𝑻𝒕𝒕 (𝑻𝑻𝒕𝒕 ) , 𝑿𝑿 = 𝑻𝑻𝒄𝒄 finally:

𝑨𝑨𝑨𝑨 = 𝑿𝑿𝑿𝑿

According to this mathematical model, the solution to the transformation of the


camera coordinate system to the robot coordinate system requires us to establish an end-
effector frame, which is done by calculating the Tool Center Point (TCP) of the end
effector, a target is also required, for this experiment we use a ChAruco board as the target
because it is easy to detect, having the flexibility of an ArUco marker with the precision
of a normal checkerboard commonly used for calibration.

Figure 8. The eye in hand problem. Image adapted from [12]

2.7. 6D pose estimation and object grasping

Once the manipulator brings the hand-eye camera close to the target, the ChAruco
board below the target is detected and its pose is estimated by using Perspective N Point
(PnP). Since the object of interest can be at any point on the ChAruco surface, and its
frame orientation is at fixed angles to the ChAruco, we are only interested in the surface
orientation of the ChAruco, and the object position is going to be acquired with our
trained CNN model. Once the ChAruco orientation is acquired solving the PnP problem,
we acquire the rotation existing between the hand-eye camera frame to the target frame
𝑅𝑅𝐶𝐶𝑇𝑇 , we transform this rotation to relate the gripper end-effector frame with the target
frame by the expression 𝑅𝑅𝐺𝐺𝑇𝑇 = 𝑅𝑅𝐶𝐶𝑇𝑇 𝑅𝑅𝐺𝐺𝐶𝐶 . Since we cannot move the robot in end-effector
frames, we have to apply the so-called similarity transform which allows to convert a
given linear transformation in the camera frame to the same linear transformation in the
world robot-frame, the similarity transform is expressed in (3). Once the rotation matrix
to position the gripper normal to the surface of the ChAruco is obtained, we parametrize
it with the roll-pitch-yaw representation.

𝐺𝐺 −1 𝐺𝐺
𝑅𝑅𝑤𝑤 = (𝑅𝑅𝑊𝑊 ) 𝑅𝑅𝐶𝐶 𝑅𝑅𝑊𝑊 (3)
Figure 9. Linear transformations between frames.

Once the gripper is normal to the surface of the ChAruco, as shown in Figure 10,
the CNN model is used again to detect the position of the target in pixels. Using the
previously introduced method, the same mapping is done from (u, v) pixel coordinates
into (𝑋𝑋𝐶𝐶 , 𝑌𝑌𝐶𝐶 , 𝑍𝑍𝐶𝐶 ) camera coordinates, where the 𝑍𝑍𝐶𝐶 component is extracted from the
ChAruco board. The linear transformation from camera coordinates into world
coordinates is now used, as shown below

𝑷𝑷𝑾𝑾 = 𝑻𝑻𝑾𝑾
𝑪𝑪 𝑷𝑷
𝑪𝑪

𝑷𝑷𝑾𝑾 = 𝑻𝑻𝑾𝑾 𝑮𝑮 𝑪𝑪
𝑮𝑮 𝑻𝑻𝑪𝑪 𝑷𝑷

𝑷𝑷𝑪𝑪 represents camera coordinates of the target, 𝑷𝑷𝑾𝑾 the world coordinates of the target,
𝑻𝑻𝑮𝑮𝑪𝑪 is the homogeneous transformation from camera coordinates to end-effector
coordinates, which is known after solving the hand-eye calibration problem.
Figure 10. Orientation of the gripper normal to the surface and reposition towards the object of interest.

Since the orientation and coordinates of the object are now known, the object
can be grasped and moved without much trouble.

2.8. Computer–Robot Interface

The computer and the iRX6 digital servo controller (robot controller) are linked
together by Ethernet communication. The computer runs a Python code that implements
the object detection model, the 2D pose estimation algorithm, and interacts with external
devices such as the fixed and hand-eye cameras, the 6DOF robot manipulator and the
gripper. The iRX6 receives commands or messages from the computer to action the
manipulator towards the target with the desired pose to perform the grasp. Figure 11
shows the overall interface between the computer and the robot controller.

Figure 11. Computer-Robot Interface Schematic diagram.


3. Results

Figure 12. The procedure's phases in order.

As seen in the figure above, initially, the manipulator awaits instructions from the
computer. Once the image from the fixed camera was processed and the object was
detected, its robot coordinates were acquired and sent to the robot servo controller to
approach the target. Immediately after, the ChAruco board pose was detected and used to
orient the end-effector using the image from the hand-in camera. Then, the 3D position
of the object was detected having the depth acquired solving the PNP problem using
landmarks from the ChAruco. Since the pose of the object is defined, the manipulator is
then able to grasp the target and place it on the desired position. Finally, the robot arm
returns to its initial position, ending the process.
4. Conclusion

The employment of the ChAruco board alongside CNN object detection proved to be
feasible in performing pick-and-place tasks. However, the sequence of motions is not
smooth enough to compete with other methods using depth sensor devices such as in [13].
In the near future we hope to update the method presented in this project to require
less intrusive landmarks near the target, enhancing its versatility and applications, while
retaining its performance and inexpensiveness.
References

[1] E. Zvornicanin, "What is Yolo Algorithm?," 04 November 2022. [Online].


Available: https://www.baeldung.com/cs/yolo-algorithm. [Accessed 15
December 2022].

[2] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich feature hierarchies


for accurate object detection and semantic segmentation," 2014 IEEE
Conference on Computer Vision and Pattern Recognition, 2014.

[3] R. Girshick, "Fast R-CNN," 2015 IEEE International Conference on


Computer Vision (ICCV), 2015.

[4] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You only look once:
Unified, real-time object detection," 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.

[5] G. Jocher, YOLOv5 by Ultralytics, 2020.

[6] Roboflow, "Image Augmentation," [Online]. Available:


https://docs.roboflow.com/image-transformations/image-augmentation.

[7] [Online]. Available: https://www.makesense.ai.

[8] J. Solawetz, "Mean average precision (MAP) in object detection," 2022 11


25. [Online]. Available: https://blog.roboflow.com/mean-average-precision/.

[9] Ultralytics, "Load Yolov5 from Pytorch Hub ⭐ · issue #36 ·


ultralytics/yolov5," [Online]. Available:
https://github.com/ultralytics/yolov5/issues/36.

[10] Z. Zhang, "A flexible new technique for camera calibration," IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 11,
2000.

[11] G. An, S. Lee, M.-W. Seo, K. Yun, W.-S. Cheong and S.-J. Kang,
"Charuco board-based omnidirectional camera calibration method,"
Electronics, vol. 7, no. 12, 2018.
[12] T. A. Myhre, "Robot camera calibration," [Online]. Available:
https://www.torsteinmyhre.name/snippets/robcam_calibration.html.

[13] T.-T. Le, T.-S. Le, Y.-R. Chen, J. Vidal and C.-Y. Lin, "6d pose estimation
with combined deep learning and 3D vision techniques for a fast and accurate
object grasping," Robotics and Autonomous Systems, vol. 141, 2021.

[14] J. Solawetz, "What is Yolov5? A guide for beginners.," 29 June 2020.


[Online]. Available: https://blog.roboflow.com/yolov5-improvements-and-
evaluation/.

You might also like