0% found this document useful (0 votes)

20 views12 pages

Fujipress - JRM 29 2 1

This document reviews the current status and future trends in robot vision technology. It discusses 3D object recognition as the core technology, describing 3D sensors used to acquire point cloud data and data structures. It reviews important technologies for 3D local features and reference frames for model-based object recognition. It also introduces technologies for high accuracy and ease-of-use robot vision applications, as well as general-object recognition based on affordance. Finally, it discusses examples of factory applications and offers views on future prospects and trends in robot vision.

Uploaded by

Chris Sej

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views12 pages

Fujipress - JRM 29 2 1

Uploaded by

Chris Sej

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

https://doi.org/10.20965/jrm.2017.

p0275

Current Status and Future Trends on Robot Vision Technology

Review:

Current Status and Future Trends on Robot Vision Technology

Manabu Hashimoto∗, Yukiyasu Domae∗∗ , and Shun’ichi Kaneko∗∗∗
∗
Chukyo University
101-2 Yagoto-Honmachi, Showa-ku, Nagoya, Aichi 466-8666, Japan
E-mail: [email protected]
∗∗ Mitsubishi Electric Corporation

8-1-1 Tsukaguchi, Hon-machi, Amagasaki, Hyogo 661-8661, Japan

E-mail: [email protected]
∗∗∗ Hokkaido University

Kita-14, Nishi-9, Kita-ku, Sapporo 060-0814, Japan

E-mail: [email protected]
[Received January 31, 2017; accepted February 16, 2017]

This paper reviews the current status and future

trends in robot vision technology. Centering on the
core technology of 3-dimensional (3D) object recog-
nition, we describe 3D sensors used to acquire point
cloud data and the representative data structures.
From the viewpoint of practical robot vision, we
review the performance requirements and research
trends of important technologies in 3D local fea-
tures and the reference frames for model-based object
recognition developed to address these requirements.
Regarding the latest development examples of robot
vision technology, we introduce the important tech-
nologies according to purpose such as high accuracy
or ease-of-use. Then, we describe, as an application ex-
ample for a new area, a study of general-object recog- Fig. 1. 3D bin-picking system.
nition based on the concept of affordance. In the area
of practical factory applications, we present examples
of system development in areas attracting recent at- individual objects are computed. This data is transmitted
tention, including the recognition of parts in cluttered to the robot for grasping the target object. As indicated
piles and classification of randomly stacked products. in the figure, there are two major issues in object recog-
Finally, we offer our views on the future prospects of nition. The first concerns the 3D measurement required
and trends in robot vision. to convert the 3D object into the most accurate data possi-
ble. The second concerns the data processing necessary to
properly recognize such properties as the position, pose,
Keywords: robot vision, object recognition, 3D feature, and type of target object from the data.
industrial robot system, future trends The latter can be divided into the “pick and place” task,
where the main objective is the grasping and picking by
a robot such as in the transfer or sorting of objects on
1. 3D Object Recognition for Robot Vision belt conveyors; and the subsequent “pick and insert” task,
where the grasped object is precisely mounted (inserted)
1.1. Overview of Industrial Robot Systems to another part. Whereas the required accuracy differs
with the type of task, the object’s pose and position must
A typical application of object recognition in robot vi- be computed in both types of task.
sion is bin picking, as illustrated in Fig. 1, where the po- There are two functions required of robot vision to per-
sition and pose of target objects in a cluttered pile are rec- form these tasks: the function to detect the object to be
ognized and grasped (picked) by a robot hand. grasped from data consisting of a photograph (or image)
The range data (depth data) acquired from a 3- of some unknown scene and to recognize its approximate
dimensional (3D) sensor of a group of target objects piled position and pose (coarse recognition), and the function
randomly in front of the robot is called a point cloud or to utilize this result to perform a higher-level recognition
point group data. Based on this data, the position and pose (precision alignment). Coarse recognition is important in
parameters (rotation matrix R and translation vector t) of “pick and place,” whereas coarse recognition and precise

Journal of Robotics and Mechatronics Vol.29 No.2, 2017 275

© Fuji Technology Press Ltd. Creative Commons CC BY-ND: This is an Open Access article distributed under the terms of
the Creative Commons Attribution-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-nd/4.0/).
Hashimoto, M., Domae, Y., and Kaneko, S.

alignment are both important in “pick and insert.” Be-

cause precise alignment, for which the iterative closest
point (ICP) method [1] to minimize mismatches between
the 3D model and input data, and its extensions [2–8]
are frequently used, constitutes a well-established tech-
nology, studies on coarse recognition are more numerous
today. While the distance images or 3D points (point
cloud data) obtained by a range finder provide the ba-
sic input data in object recognition, there are many cases
where monochromatic or color images captured from a
coaxial or near-coaxial viewpoint can be obtained, where
both types of information are integrated and used [9]. Fig. 2. Example of structure of 3D point cloud data (PCD).

1.2. Major 3D Measurement Sensors

Recent years have seen the rapid increase of highly
practical technologies and optical sensor products used to
measure the 3D shape of target objects. The sensors can
be classified into two types in terms of the optical axis:
those based on coaxial measurements, based on the use
of a single optical axis, and those based on triangulation,
based on the use of two or more optical axes. In terms Fig. 3. Appearance-based and model-based object recognition.
of the energy transfer from the target object to the sensor,
they can be divided into passive sensors, which employ
only the incident light coming from the target object, and
active sensors, which project a light beam to the target ob- Appearance-based recognition is based on information
ject and use the returned light from the object. Thus, four regarding the appearance of the target object. The ba-
types are possible based on a combination of these types. sic concept is to prepare a database consisting of a large
Although stereo vision, which is representative of the number of images of model objects captured from vari-
passive triangulation type, has frequently been used in ous angles along with the associated line-of-sight infor-
the past, the use of active sensors has increased in re- mation; then, the target object’s position and pose are es-
cent years owing to advances in the technology (called timated by determining the image closest to the unknown
“structured light”) to generate various light projection input data. Because the storage of vast amounts of im-
patterns. In particular, triangulation sensors that project age data and an inefficient search present practical bottle-
structured light such as a random-dot pattern are used necks, methods such as the “assembly plan from obser-
frequently. This is also because of the wide availability vation” (APO) [10], which structures the appearance, or
of high-resolution image sensors. Meanwhile, the devel- the treatment of images as manifolds parametrically ex-
opment of high temporal-resolution optical detectors and pressed in an eigenspace [11–13], have been developed.
the technology to project modulated light has led to the Model-based recognition is based on the use of 3D ob-
increased practical development of time-of-flight (TOF) ject models, where the target object is recognized based
sensors, which are based on a type of active coaxial mea- on an evaluation of the consistency (concordance) be-
surement where the issue of insufficient spatial resolution, tween the model and (parts of) the input scene. Although
which had previously presented a bottleneck, is resolved. this requires the capture of 3D input scenes, studies on
The majority of these sensors output point cloud data model-based recognition have steadily increased in num-
described in terms of 3D xyz coordinates. In practice, ber with the development of spatial-code range finders in
point cloud data is limited to points on an object’s surface the 1980s and the release of a simple-to-use sensor, the
that have a normal pointing towards the sensor, although Kinect sensor, from Microsoft in 2010.
it is frequently called 3D data for convenience. Fig. 2 is
an example format of the point cloud data used with many
sensors. A heading that specifies the image size and other 2. Performance Required in Practical Robot
items is placed at the head of the data structure, followed Vision
by the point cloud data consisting of the xyz coordinate
values. 2.1. Role of Robot Vision
The main role of robot vision consists of acquiring and
1.3. Current Status of 3D Object Recognition Algo- comprehending information of a target object or external
rithms surroundings, which is information necessary for plan-
Object recognition technology can be classified into ning a robot’s action. In this section, we consider the
two types: appearance-based and model-based recogni- assembly process in factory automation (FA). Many in-
tion, as illustrated in Fig. 3. dustrial robots are in operation in the FA field. In the past,

276 Journal of Robotics and Mechatronics Vol.29 No.2, 2017

Current Status and Future Trends on Robot Vision Technology

they were applied to simple tasks such as conveyance or

welding. In recent years, however, system development
where emphasis is placed on advanced object handling
has progressed, such as cell-production robots [14, 15]
designed for multiproduct variable-quantity production.
One main application example is the assembly process of
electric and electronic products. The assembly process in-
cludes parts supply, which consists of handling objects in
a cluttered pile, to mounting or inserting a part on a prod-
uct, which requires high-level alignment, thus requiring
multiple high-level object manipulation processes.
The first step of object manipulation in the assembly
process is the grasping of an object with an indeterminate
pose. When parts contained in a bag are placed in the parts
bin, the parts are piled in a disorderly manner (Fig. 4). It Fig. 4. Example of parts supplied in cluttered pile.
is necessary to recognize the object’s position and pose
by robot vision because the part’s pose is indeterminate.
After the object has been grasped, the task is divided into
“pick and place” and “pick and insert.” The latter involves
positioning the grasped object to a specific pose. Mount-
ing the part to a product and its orderly placement on a
part shelf (kitting) are both included in this category.
In “pick and place,” the emphasis is placed on grasp-
ing the object. In some cases, multiple products must be
recognized and sorted. This requires that the robot vision
perform coarse recognition (estimation of the position and
pose of the target object, estimation of the grasping posi-
tion) and identify the object type. In “pick and insert,” it
is necessary to change the pose of the grasped object and
ensure the final mounting accuracy. Robot vision requires
precision alignment to create an action plan to change the
object’s pose. Fig. 5. Human operator teaching industrial robot.

2.2. Functions and Performance Required in Robot estimation involves a complex process. Thus, it is impor-
Vision tant to produce an algorithm appropriate for its objective.
The performance and functions that are important in Moreover, achieving constant task time and tact time is
practice are discussed below in terms of precision, speed, important in FA that involves repetitive tasks.
environmental robustness, and easy to use.
(3) Robustness Against Environmental Disturbance
It is important for the robot to be able to perform in a
(1) Precision stable manner regardless of the environment. Robot vi-
High assembly precision is required in FA. For exam- sion is affected by environmental light and temperature.
ple, precision in the sub-millimeter to micrometer order Because it is particularly affected by the environmental
can be required in the assembly of electric or electronic light, algorithm design is important. Furthermore, the ef-
parts. The factors that determine precision in robot vision fects of noise, vibrations, dust, water droplets, and other
are the resolution and error characteristics. These depend factors must also be controlled depending on the applica-
on both hardware, e.g., camera or projector, and software, tion environment. This is frequently addressed by hard-
e.g., measurement algorithm and image-processing algo- ware design.
rithm. Moreover, because the visual field and working
distance have a tradeoff relation with precision, system (4) Easy to Use
design is necessary. There are cases where this is resolved It is also important to consider the operations of non-
by the parallel use of multiple sensors. expert workers when designing a robot vision system. For
example, the adjustment of parameters in image process-
(2) Speed ing is frequently difficult to understand for non-experts.
Because high speed and real-time operability are re- Thus, an important issue in practice is to reduce the num-
quired in a robot that replaces human workers, high speed ber of parameters, make parameter adjustments easy to
is important in robot vision as well. The algorithm’s pro- perform, and automate the adjustment process. Fig. 5
cessing time can become excessively lengthy if the pose illustrates a human operator teaching movements to an

Journal of Robotics and Mechatronics Vol.29 No.2, 2017 277

Hashimoto, M., Domae, Y., and Kaneko, S.

industrial robot. It is important to simplify this task.

Furthermore, “robot-human collaboration” is a frequently
used keyword in recent years in the field of industrial
robots. To work compatibly alongside human workers,
easy-to-use is important in addition to securing adequate
safety and basic performance; this makes it important for
the robot to be capable of a high degree of automatic task
planning as it performs fine sensing of the target object,
human workers, task, and surrounding environment.

3. Trends and Major Technologies in Model-

Based Object Recognition
3.1. Outline of Object Recognition Based on 3D
Features
Fig. 6. SHOT features [16].
The role of 3D features in object recognition is to im-
part an identity such as the local shape to keypoints de-
tected among the 3D points. Using these identities, it
is possible to produce indices of correct matching when
comparing keypoints. Whereas this makes it preferable to
use a feature that strongly identifies that point, it is also
necessary to note that the feature must be described sta-
bly in both the model and input data. This performance
item is referred to as repeatability, which is an important
performance factor in feature design. Whereas many ap-
proaches to produce 3D features have been proposed, they
can be divided into those that describe the features near
keypoints and those that describe the relation between the
positions or normals of two or three points.

3.2. Major 3D Features Fig. 7. Point pair features [17].

The description of features surrounding the keypoint is
normally based on 3D points within a certain region (de-
scription region) centered on the keypoint and their as- As indicated in Fig. 7(a), two points (point pair) are sam-
sociated information. The description can consist either pled from among all 3D points on the object model. Then,
of computed information stored in raw form or statisti- four parameters, i.e., the distance (F1 ) between the two
cal values represented by histograms. Furthermore, the points, the angles between the line segment connecting
description region is sometimes divided in advance into the two points and the normals of the two points (F2 , F3 ),
multiple subregions to preserve the distribution of the fea- and the angle (F4 ) between the normals at the two points,
tures computed for each subregion. This is equivalent to are defined as the four-dimensional feature F.
having the features contain the rough position informa- As indicated in Fig. 7(b), the transformation param-
tion. eters between point pairs are determined after they are
A representative example of a feature in this category is matched. When the parameters are voted to the voting
the Signature of Histograms of OrienTations (SHOT) [16] space, a parameter with a high vote count indicates a
proposed by Tombari et al. As indicated in Fig. 6, a spher- rigid-body transformation parameter that is common to
ical region surrounding a keypoint is divided into 32 re- many point pairs; hence, even if a certain number of mis-
gions, each of which is represented by a histogram of in- recognitions are produced, it remains possible to deter-
ner products expressing the relation between the normal mine the correct position and pose of the model. An ex-
vectors of points in that region and the keypoint normal, tended PPF method has been proposed by Choi et al. [18]
to produce a 352-dimensional feature. who introduced an improved method to sample the point
Features that relate multiple points describe the geo- pairs.
metric relations between two or three points. Thus, the Another feature type that belongs to the same category
two major design elements are the selection of sampled is the vector pair feature [19] displayed in Fig. 8. In this
points on the model surface to be combined and the se- method, three 3D points, which is the minimum set of
lection of geometric parameters to use for that combina- data to identify a 3D pose, are represented by two space
tion. A representative example is the point pair feature vectors that share a common initial point; the three end
(PPF) [17] proposed by Drost et al. illustrated in Fig. 7. points are given various descriptors. Misrecognitions are

278 Journal of Robotics and Mechatronics Vol.29 No.2, 2017

Current Status and Future Trends on Robot Vision Technology

Fig. 8. Vector pair features [19].

Fig. 10. SHORT features [22].

Fig. 9. Local reference frame (LRF). rotational projection statistics (RoPS-LRF) [21], the LRF
is defined to absorb the effect of the density differences
between the matched point groups.
eliminated by analyzing the occurrence probability of the
vector pairs in the model and by selecting those that are
unique. 3.4. Examples of Recent Studies
An example of a study on 3D features is the Shell
3.3. Major Local Reference Frames Histograms and Occupancy from Radial Transform
A local reference frame (LRF) refers to the coordinate (SHORT) [22] proposed by Takei et al. The use of a
system set up at each keypoint, as indicated in Fig. 9. support sphere surrounding the keypoint is the same as
The LRF’s most important role is in the definition of in other approaches; however, the interior is removed
features. The various features described in the previous to produce a shell structure and the 3D points in the
section are expressed numerically based on the LRF, such shell are counted to compute their space occupancy ratio
that LRF stability is directly related to the stability of the (Fig. 10(a)).
features. Because the LRF expresses the geometric rela- Because this index has the tendency to be high on pla-
tion of two matched keypoints, it can also be used to esti- nar surfaces and low in shape discontinuities, it can be
mate the object’s pose. The LRF is typically a rectangular used to detect keypoints. Furthermore, by setting multiple
3D coordinate system. The first axis (z-axis) is in many support spheres (shell regions) with different radii, as in
cases the normal vector of the local surface surrounding Fig. 10(b), the directions (from the keypoint) of the points
the keypoint under consideration; this can be determined in each shell layer can be accumulated as histograms to
in a relatively stable manner. The second axis (x-axis) is produce features. Thus, SHORT produces features that
a vector normal to this. The third axis (y-axis) is com- indicate shape characteristics without the explicit compu-
puted as the vector product of the first and second axes. tation of normal vectors, in both keypoint detection and
Thus, the setting of the second axis (x-axis) is the most computation of its features.
important practical issue when setting an LRF. Akizuki et al. proposed a stable LRF called the dom-
The most basic LRF is obtained by Mian’s method [20], inant projected normal (DPN)-LRF [23]. The concept
where the covariance matrix is computed from the 3D co- is displayed in Fig. 11. To address the density differ-
ordinates of the points surrounding the keypoint, and the ences between point groups to be matched, the axis di-
eigenvectors are used to build the LRF. In this method, the rections are computed by considering the object’s orig-
eigenvectors obtained from the covariance matrix of the inal shape present among the measured points, similar
coordinate data of points within a spherical region with to RoPS-LRF [21]. Furthermore, to improve robustness
radius r, are directly used as the LRF. For example, for a against partial occlusions, the dominant direction is com-
keypoint sampled from among the points lying in a quasi- puted by analyzing the distribution of normal directions.
planar region, the third principal eigenvector is equivalent
to the normal vector. Because the eigenvectors form an
orthogonal basis, it is natural to employ them as the LRF. 4. Latest Development Examples in Robot
Tombari et al. [16] improved this method by computing Vision
the eigenvectors from a weighted covariance matrix, us-
ing weights that decrease with the distance from the key- Practical examples of robot vision are introduced be-
point and thus drastically improved the reproducibility. In low.

Journal of Robotics and Mechatronics Vol.29 No.2, 2017 279

Hashimoto, M., Domae, Y., and Kaneko, S.

Fig. 11. DPN-LRF [23].

Fig. 13. Detected defect region.

this section, we introduce a 3D measurement and recog-

nition method used for cables with connectors [26]. Be-
cause the cable part is flexible, a vision with a wide vi-
sual field is required to measure the pose of a cable-with-
connector and manipulate it with a robot; a high manip-
ulation accuracy is necessary for the assembly process.
The proposed method achieves high-accuracy 3D mea-
surement using structured light with a camera and pro-
Fig. 12. Example of mold defect. jector. However, alone this would limit the visual field.
Thus, a sensor based on structured light is mounted to the
industrial robot to provide it with a hand-eye. A wide vi-
4.1. Accuracy: Increasing Accuracy of ICP sual field is ensured with the robot’s motion and the cam-
era’s motion stereo. The developed sensing method em-
We introduce a development example of a highly effi- ploys a coarse-to-fine strategy, where the motion stereo
cient and accurate method to measure complex 3D shapes provides coarse recognition and the structured light fine
that include steep surfaces [24]. This is an extension of recognition. An example of measurement results is pre-
M-ICP [25], which robustly matches 3D shapes with out- sented in Fig. 14. Figs. 14(a) and (b), respectively, dis-
lying data (unexpected noise or non-overlapping data). In play the measured and estimated coarse shape of the ca-
such shapes, the measurement pitch on the flat planar re- ble part acquired by the motion stereo. The accuracy is
gion (for instance, a plane normal to the sensor’s optical on the millimeter order, which is coarse for object han-
axis) is small and hence can produce good measurements; dling, yet sufficient to estimate the position of the connec-
however, on the steep surface, this is larger (the distance tor part and determine the next measurement viewpoint.
between measurement points increases) and produces a Fig. 14(c) displays the results of the pose measurement of
low measurement accuracy. There is a built-in tendency the connector part based on the structured light. The mea-
in ICP to place emphasis on the degree of correspondence surement viewpoint is determined based on (b). In this
of the latter. Furthermore, defective shapes tend to appear manner, high-accuracy measurement of the connector is
more frequently on these steep surfaces. The approach in- performed efficiently. The proposed approach thus makes
troduced in this study reduces the relative weights of such it possible to recognize the pose of the cable connector
ill-conditioned data when evaluating the correspondence. efficiently and at an accuracy suitable for the assembly
A correspondence evaluation value is designed by intro- process.
ducing weighting coefficients sensitive to the gradients of
the depth measurements arranged in the format of a range
image. Furthermore, an initial stage is introduced where
4.3. Easy to Use: Hand-Eye Calibration with Mini-
only data near the 3D edges of complex shapes are used mum Number of Viewpoints
to increase the processing speed. Fig. 12 illustrates the Calibration between the robot-coordinate and vision-
actual defect shape and Fig. 13 is a graph of the defect re- coordinate systems is a prerequisite adjustment of robot
gion determined by the correspondence processing (cross vision. Whereas markers with known shapes are typically
points indicate the defect on the steep surface). used, the ease of actual operation differs significantly de-
pending on how the transformation matrix between the
coordinate systems is computed. In the case of a hand-
4.2. Accuracy: Coarse-to-Fine Vision Strategy to eye, the information to compute the transformation ma-
Achieve Both Wide Visual Field and Accuracy trix between coordinate systems is obtained by “measur-
The use of multiple sensors is a valid method to resolve ing a known marker using the vision and ‘prodding’ the
the tradeoff between wide visual field and accuracy. In marker with the robot.” However, the teaching process to

280 Journal of Robotics and Mechatronics Vol.29 No.2, 2017

Current Status and Future Trends on Robot Vision Technology

Fig. 15. Middle-layer features for general object recognition.

the respective usage objectives for that category. Further,

they have shapes (e.g., large depression, handle) suitable
to their functions. Hence, the authors propose a method
to express 3D features in terms of these functions and
shapes. The concept of “affordance” has been offered to
express the relation between an object’s partial shape and
its manipulation by a human. This has been applied in
recent years to the automatic generation of robotic ma-
Fig. 14. Parallel use of two sensing methods for coarse- nipulation. Employing features that focus on affordance
to-fine recognition: (a) coarse 3D shape of cable-within object recognition not only makes object recognition
connector acquired by motion stereo, (b) pose of cable-with- possible but also allows the recognition results to be used
connector estimated from results of (a), and (c) estimated re- for robotic manipulation.
sults of connector pose based on structured light after view- As indicated in Fig. 15, cups and hammers of various
point has been determined from estimated position of (b).
shapes make up the classes “cup” and “hammer,” respec-
tively. In specific-object recognition, the 3D local features
extracted from the point cloud data at the bottom layer can
prod the marker can be difficult in environments such as be used. However, new features common to all of the ob-
Fig. 14. Moreover, accuracy can be an issue if this is per- jects in a class are required in general-object recognition
formed manually. Therefore, in a hand-eye, the transfor- because each object belonging to a class can have a dif-
mation matrix is computed by “measuring the marker sev- ferent shape. These features are “mid-level” features that
eral times from different viewpoints using the vision sys- lie between point-cloud 3D features and class, and their
tem and acquiring the corresponding robot coordinate po- design is important.
sitions.” Yet, it is frequently difficult to increase the view- Such daily commodities as cups and hammers, which
points used for teaching in narrow environments. Thus, are targeted for recognition in this study, have commonly
it is important to reduce the number of viewpoints [27]. recognized usage objectives and the functions to achieve
In the proposed approach, the minimum number of view- these objectives are frequently expressed by the object’s
points is derived solely from linear algebraic computation, shape. For instance, because the objects belonging to the
without optimization, and used for calibration. In compar- class “cup” are used by humans to drink beverages, they
ison to the method of “prodding the marker,” improved have a depression that functions to hold liquid and a han-
performance is achieved in both accuracy and work time. dle that functions to be held by the hand. Thus, the in-
formation originating from the shape, which is expected
to realize a function, is considered an attribute of the part
4.4. New Area: Affordance-Based General Object and called an “affordance label” in this study. Examples
Recognition for the class “cup” are displayed in Fig. 16.
Versatility to address different commodities is required From the viewpoint of general object recognition, it
in life-support robots for use at home or in public facili- suffices to extract features common to the objects in a
ties. This is a problem in general object recognition that given class; there is no requirement for such features to
cannot be addressed with only 2D image-based classifi- be connected to the concept of affordance. However, us-
cation because the part to be managed must ultimately be ing affordance, the objects will be divided into parts upon
identified or the pose recognized for the robot hand to ma- completion of the classification and important informa-
nipulate the object. tion regarding robot manipulation can be obtained.
In this study, the authors acknowledge the fact that Because daily commodities are designed on the
many commodities such as cups and hammers have func- premise of being used by humans, in the case of the class
tions (e.g., store liquid, be grasped by the hand) to suit “cup” for example, there are certain inherent limitations

Journal of Robotics and Mechatronics Vol.29 No.2, 2017 281

Hashimoto, M., Domae, Y., and Kaneko, S.

Fig. 16. Example of common affordance-labels existing in

objects of class “cup.”

Fig. 18. Proposed general object recognition algorithm us-

ing the affordance feature.

Fig. 17. Example of proposed Affordance feature for class

“cup” and class “spoon.” Fig. 19. Result of extraction of affordance labels for cup,
spatula, hammer, and spoon.

in the size of the part that holds the liquid or that of the
hand-grasped part. In this study, therefore, the distribution
of the occupation ratio of parts that have been assigned
affordance labels, described previously, is defined as the
“affordance feature” and considered a common feature for
that class. The correlation of the affordance feature is high
between objects in the same class, however, low between
objects in different classes. Examples of affordance fea-
tures for the classes “cup” and “spoon” are displayed in
Fig. 17.
The flow of object recognition using affordance fea-
tures is illustrated in Fig. 18. In this method, a 3D scene Fig. 20. Result of object recognition using the proposed method.
reconfigured using an approach such as Kinect Fusion is
used as the input. First, each object in the input scene
is segmented and an affordance label is assigned to each and “spatula.”
point. Label classification is performed using Random Figure 20 displays an example of recognition results of
Forests. Next, the distribution of the occupation ratios of three types of common objects (commodities) using the
the affordance labels in each segment is computed as its proposed method.
affordance feature, which is then compared to the affor-
dance features (dictionary data) for various classes pre-
pared in advance to estimate the class name of the object 5. Latest Development Example of Robot Sys-
in the scene.
The extraction results of the affordance labels for the tem with Vision
four classes, cup, hammer, spoon, and spatula are dis-
played in Fig. 19. Based on the input data, it can be 5.1. Robot Cell for Assembly Including Flexible
observed that the “contain” and “grasp” labels have been Objects
extracted at the proper respective positions for the case of The manipulation of flexible objects is an important
“cup,” and similarly for the cases of “hammer,” “spoon,” issue in robots. In this section, we introduce a robot

282 Journal of Robotics and Mechatronics Vol.29 No.2, 2017

Current Status and Future Trends on Robot Vision Technology

Fig. 22. Robot-cell supplying various types of mechanical parts.

Fig. 21. Robot-cell assembling products that include flexi-
ble objects.

cell [28] that automatically assembles servo amps. The

overall view of the system is presented in Fig. 21. The
cell contains two robots, one equipped with the robot vi-
sion described in Section 4.2. The other robot mounts
a kitted (i.e., orderly placed) cable-with-connector to the
product. It also mounts a separate rigid-body part, which
has also been kitted.
The pose of each cable-with-connector, which has been
supplied in a cluttered pile, is recognized using the robot
vision described in Section 4.2. The recognized cable-
Fig. 23. Intelligent robot classifying randomly stacked
with-connector is picked from the supply table and placed
items in bin: (a) illustration of robot and setup and (b) ac-
on the kitting table. Once the part has been kitted (i.e., tual robot system.
laid out orderly), the indeterminate pose of the flexible
objects is resolved. Therefore, conventional industrial
robots, based on teaching, can be used to achieve assem-
bly work in a rational manner. specifically designed for the picking process is used for
recognition. The graspable position of the part is searched
by matching the model of the cross-section shape of the
5.2. Bin-Picking System for Parts with Various robot hand and the range image acquired by the 3D sen-
Shapes sor [30]. This makes it possible to pick multiple parts
Whereas there have been many research and develop- with different shapes supplied separately using a single
ment examples of the supply and bin picking of piled robot. The three robots of the later stage re-grasp the
parts, it has been considered difficult to automate the ma- part. The first robot places the part on the flat table. Two-
nipulation of diverse parts. In this section, we introduce dimensional vision installed on the ceiling recognizes the
a robot cell [29] that addresses a piled supply of diverse part’s pose. The re-grasping motion with the shortest path
parts. The overall view of the robot cell is provided in according to the part’s pose is realized based on a shape
Fig. 22. A feature of this cell is that the part supply has model of the part and a model of the robot hand. This
been divided into two processes. One is the process of makes it possible to realize, for the first time, a high-speed
picking the part from a cluttered pile. The other is the general-purpose multiple parts supply. Although the ex-
process of re-grasping the picked part to the desired pose. periment system is a pipeline design consisting of four
When one attempts to perform the processes of recogniz- robots, it is possible to realize a similar process with a
ing an object’s pose, picking it from a cluttered pile, and single robot. Comparisons with conventional robot sys-
kitting it at the desired pose in a single step, the com- tems and part feeders are provided in detail in [29].
plexity of recognition processing, the limited range of the
robot’s movement, and high-level manipulation present a
bottleneck, especially to processing diverse parts. A sys- 5.3. Robot that Classifies Randomly Stacked Items
tematic solution to this is to divide the process into two With the introduction of robotic automation to distri-
parts. Of the four robots in Fig. 22, the left-most robot un- bution warehouses, which has increased significantly in
dertakes the first process of picking the part. It is equipped recent years, the task of picking after classifying multiple
with a hand eye consisting of a 3D sensor. An algorithm types of products is important. Fig. 23 presents a robot

Journal of Robotics and Mechatronics Vol.29 No.2, 2017 283

Hashimoto, M., Domae, Y., and Kaneko, S.

6. Future Prospects of Robot Vision Technolo-

gies
The progress of robot vision technology is supported
by advances in both measurement/sensing technology to
produce accurate data of the target object and algorithms
to analyze the acquired information and estimate not only
the object’s position and pose but also the product type or
the robot’s approach position. The development speed is
accelerating owing to the availability of high-performance
machine learning algorithms as represented by deep learn-
ing, and the changing environment such as the accumula-
tion of vast digitalized knowledge of objects and object
recognition in the enormous cloud environment.
For example, based on the reinforcement-learning
framework, Levine et al. repeated 800,000 pickings with
14 robot arms to detect the grasping point and realize
general-purpose picking with a high success rate [35].
Fig. 24. Object classification algorithm which is used in the
In [36], the authors went beyond a search of the grasp-
system shown in Fig. 23.
ing point in the image and succeeded in teaching the robot
motion (torque commands for the axes) for a required task
by providing it with images of task success and failure.
system that classifies and picks different products that Thus, they presented a framework for robot learning of
have been randomly stacked on shelves. Classification of a particular task based on “learning by imitation.” Al-
the product is important in this process. This problem though there remain practical issues with respect to accu-
was selected in the Amazon Picking Challenge, an inter- racy and speed, high expectations are being placed on this
national competition on the use of robots for tasks in dis- as a technology that may dramatically improve the ease-
tribution warehouses held during the 2015 IEEE Interna- of-use of robot vision from the user’s viewpoint.
tional Conference on Robotics and Automation (ICRA), The increasing speed of the vision system is another
an international conference on robotics. Fig. 24 displays important trend. Technologies are under development for
a schematic of the object classification algorithm used by high-speed image sensor elements and the high-speed im-
the authors in the 2015 competition. By combining a age processing methods and architectures based on these
method to recognize the object’s pose based on color fea- elements [37, 38]. High expectations are being placed on
tures with a method to recognize the object’s position and their expansion to new applications such as the recogni-
pose based on shape features using point clouds, we real- tion of high-speed moving objects and high-speed control
ized a recognition method that can address different types of high-speed robots.
of products. This system was used to successfully clas-
sify and pick 28 types of products. This competition is
held annually; recognition technology is continually pro- References:
[1] P. J. Besl and N. D. McKay, “A Method for Registration of 3-D
gressing. Shapes,” IEEE Trans. Pattern Analysis and Machine Intelligence
(PAMI), Vol.14, No.2, pp. 239-256, 1992.
[2] S. Granger and X. Pennec, “Multi-scale EM-ICP: A Fast and Robust
5.4. Picking Robot that Utilizes Deep Learning Approach for Surface Registration,” European Conf. on Computer
Vision, Vol.2353, pp. 418-432, 2002.
Although the majority of the teams in the 2015 Amazon [3] D. Chetverikov, D. Svirko, D. Stepanov, and P. Krsek, “The
Picking Challenge employed feature-based object recog- Trimmed Iterative Closest Point Algorithm,” Proc. Int. Conf. on Pat-
tern Recognition, Vol.3, pp. 545-548, 2002.
nition [31, 32] similar to that illustrated in Fig. 24, the ma- [4] T. ZinBer, J. Schmidt, and H. Niemann, “A Refind ICP Algorithm
jority of the teams in 2016 employed object recognition for Robust 3-D Correspondence Estimation,” Proc. Int. Conf. on
based on deep learning, with improved recognition per- Image Processing, Vol.3, pp. II-695-8, 2003.
[5] S. Kaneko, T. Kondo, and A. Miyamoto, “Robust Matching of 3D
formance. Numerous application examples of deep learn- Contours using Iterative Closest Point Algorithm Improved by M-
ing are beginning to appear in robot vision as well. This is estimation,” Pattern Recognition, Vol.36, Issue 9, pp. 2041-2047,
2003.
also true for the detection of grasping point as described
[6] A. W. Fitzgibbon, “Robust Registration of 2D and 3D points sets,”
in Section 5.2. Since Jiang et al. proposed a method of Image and Vision Computing, Vol.21, pp. 1145-1153, 2003.
detecting the grasping point based on the use of deep [7] J. M. Phillps, R. Liu, and C. Tomasi, “Outlier Robust ICP for Min-
learning in 2014 [33], several learning-based approaches imizing Fractional RMSD,” Int. Conf. on 3-D Digital Imaging and
Modeling, pp. 427-434, 2007.
to grasping have been proposed [34]. Deep learning re- [8] A. Nuchter, K. Lingemann, and J. Hertzberg, “Cached K-d Tree
alizes end-to-end learning from the image to the robot’s Search for ICP Algorithms,” Int. Conf. on 3-D Digital Imaging and
Modeling, pp. 419-426, 2007.
motion and has the potential to significantly improve the [9] K. Tateno, D. Kotake, and S. Uchiyama, “A Model Fitting Method
easy-to-use of robot-vision systems. Using Intensity and Range Images for Bin-Picking Applications,”
IEICE Trans. Inf.& Syst., Vol.J94-D, No.8, pp. 1410-1422, 2011
(in Japanese).

284 Journal of Robotics and Mechatronics Vol.29 No.2, 2017

Current Status and Future Trends on Robot Vision Technology

[10] K. Ikeuchi and S. B. Kang, “Assembly Plan from Observation,” [35] S. Levine, P. Pastor et al., “Learning Hand-Eye Coordination for
AAAI Technical Report FS-93-04, pp. 115-119, 1993. Robotic Grasping with Deep Learning and Large-Scale Data Col-
[11] H. Murase and S. K. Nayar, “3D Object Recognition from Appear- lection,” arXiv:1603.02199, 2016.
ance – Parametric Eigenspace Method –,” IEICE Trans. Inf.& Syst., [36] C. Finn, X. Y. Tan et al., “Deep Spatial Autoencoders for Visuomo-
Vol.J77-D-II, No.11, pp. 2179-2187, 1994 (in Japanese). tor Learning,” arXiv:1509.06113, 2016.
[12] S. Ando, Y. Kusachi, A. Suzuki, and K. Arakawa, “Pose Estima- [37] T. Komuro, Y. Senjo, K. Sogen, S. Kagami, and M. Ishikawa, “Real-
tion of 3D Object Using Support Vector Regression,” IEICE Trans. Time Shape Recognition Using a Pixel-Parallel Processor,” J. of
Inf.& Syst., Vol.J89-D, No.8, pp. 1840-1847, 2006 (in Japanese). Robotics and Mechatronics, Vol.17, No.4, pp. 410-419, 2005.
[13] Y. Shibata and M. Hashimoto, “An Extended Method of the Para- [38] T. Senoo, Y. Yamakawa, Y. Watanabe, H. Oku, and M. Ishikawa,
metric Eigenspace Method by Automatic Background Elimination,” “High-Speed Vision and its Application Systems,” J. of Robotics
Proc. Korea-Japan Joint Workshop on Frontiers of Computer Vi- and Mechatronics, Vol.26, No.3, pp. 287-301, 2014.
sion, pp. 246-249, 2013.
[14] H. Yonezawa, H. Koichi et al., “Long-term operational experi-
ence with a robot cell production system controlled by low carbon-
footprint Senju (thousand-handed) Kannon Model robots and an ap-
proach to improving operating efficiency,” Proc. of Automation Sci-
ence and Engineering, pp. 291-298, 2011.
[15] H. Do, T. Choi et al., “Automaton of cell production system for cel- Name:
lular phones usng dual-arm robots,” J. of Advanced Manufacturing Manabu Hashimoto
Technology, Vol.83, No.5, pp. 1349-1360, 2016.
[16] F. Tombari, S. Salti, and L. D. Stefano, “Unique Signatures of His-
tograms for Local Surface Description,” European Conf. on Com- Affiliation:
puter Vision, pp. 356-369, 2010. School of Engineering, Chukyo University
[17] B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model Globally, Match
Locally: Efficient and Robust 3D Object Recognition,” IEEE Com-
puter Vision and Pattern Recognition, pp. 998-1005, 2010.
[18] C. Choi, Y. Taguchi, O. Tuzel, M. Liu, and S. Ramalingam, “Voting-
Based Pose Estimation for Robotic Assembly Using a 3D Sensor,”
IEEE Int. Conf. on Robotics and Automation, pp. 1724-1731, 2012.
[19] S. Akizuki and M. Hashimoto, “High-speed and Reliable Object Address:
Recognition Using Distinctive 3-D Vector-Pairs in a Range Image,” 101-2 Yagoto-Honmachi, Showa-ku, Nagoya, Aichi 466-8666, Japan
Int. Symposium on Optomechatronic Technologies (ISOT), pp. 1-6, Brief Biographical History:
2012. 1985 Graduated from Osaka University
[20] A. Mian, M. Bennamoun, and R. Owens, “On the Repeatability and 1987 Graduated from Graduate School, Osaka University
Quality of Keypoints for Local Feature-based 3D Object Retrieval 1987- Joined Mitsubishi Electric Corporation
from Cluttered Scenes,” Int. J. of Computer Vision, Vol.89, Issue 2- 2008- Joined Chukyo University
3, pp. 348-361, 2010.
[21] Y. Guo, F. Sohei, M. Bennamoun, M. Lu, and J. Wan, “Rotational Main Works:
Projection Statistics for 3D Local Surface Description and Object • S. Akizuki, M. Iizuka, and M. Hashimoto, “‘Affordance’-focused
Recognition,” Int. J. of Computer Vision, Vol.105, Issue 1, pp. 63- Features for Generic Object Recognition,” ECCV 2nd Int. Workshop on
86, 2013. Recovering 6D Object Pose, 2016.
[22] S. Takei, S. Akizuki, and M. Hashimoto, “SHORT: A Fast 3D Fea- • S. Akizuki and M. Hashimoto, “Stable Position and Pose Estimation of
ture Description based on Estimating Occupancy in Spherical Shell Industrial Parts using Evaluation of Observability of 3D Vector Pairs,” J. of
Regions,” Int. Conf. on Image and Vision Computing New Zealand Robotics and Mechatronics, 2015.
(IVCNZ), 2015.
Membership in Academic Societies:
[23] S. Akizuki and M. Hashimoto, “DPN-LRF: A Local Reference • The Institute of Electrical and Electronic Engineers (IEEE)
Frame for Robustly Handling Density Differences and Partial Oc-
clusions,” Int. Symposium on Visual Computing (ISVC), LNCS • The Robotics Society of Japan (RSJ)
9474, Part I, pp. 878-887, 2015.
[24] H. Kayaba, S. Kaneko, H. Takauji, M. Toda, K. Kuno, and H.
Suganuma, “Robust Matching of Dot Cloud Data Based on Model
Shape Evaluation Oriented to 3D Defect Recognition,” IEICE
Trans. D, Vol.J95-D, No.1, pp. 97-110, 2012.
[25] S. Kaneko, T. Kondo, and A. Miyamoto, “Robust matching of 3D
contours using iterative closest point algorithm improved by M-
estimation,” Pattern Recognition, Vol.36, pp. 2041-2047, 2003.
[26] Y. Domae, H. Okuda, Y. kitaaki, Y. Kimura, H. Takauji, K. Sumi,
and S. Kaneko, “3-D Sensing for Flexible Linear Object Alignment
in Robot Cell Production System,” J. of Robotics and Mechatronics,
Vol.22, No.1, pp. 100-111, 2010.
[27] Y. Domae, S. Kawato et al., “Self-calibration of Hand-eye Coor-
dinate Systems by Five Observations of an Uncalibrated Mark,”
CIEEJ Trans. On Electronics, Information and Systems, Vol.132,
No.6, pp. 968-974, 2011 (in Japanese).
[28] R. Haraguchi, Y. Domae et al., “Development of Production Robot
System that can Assemble Products with Cable and Connector,” J.
of Robotics and Mechatronics, Vol.23, No.6, pp. 939-950, 2011.
[29] A. Noda, Y. Domae et al., “Bin-picking System for General Ob-
jects,” J. of the Robotics Society of Japan, Vol.33, No.5, pp. 387-
394, 2015 (in Japanese).
[30] Y. Domae, H. Okuda et al., “Fast graspability evaluation on single
depth maps for bin picking with general grippers,” Proc. of ICRA,
pp. 1997-2004, 2014.
[31] N. Correll, K. E. Bekris et al., “Lessons from the Amazon Picking
Chllenge,” CarXiv:1601.05484, 2016.
[32] R. Jonschkowski, C. Eppner et al., “Probabilistic Multi-
Class Segmentation for the Amazon Picking Challenge,”
http://dx.doi.org/10.14279/depositonce-5051, 2016.
[33] I. Lenz, H. Lee et al., “Deep Learning for Detecting Robotic
Grasps,” Proc. of ICRA, pp.1957-1964, 2016.
[34] L. Pinto and A. Gupta, “Supersizing Self-supervision: Learning to
Grasp from 50K Tries and 700 Robot Hours,” arxiv:1509.06825,
2015.

Journal of Robotics and Mechatronics Vol.29 No.2, 2017 285

Hashimoto, M., Domae, Y., and Kaneko, S.

Name:
Yukiyasu Domae

Affiliation:
Principal Researcher, Advanced Technology
R&D Center, Mitsubishi Electric Corporation

Address:
8-1-1 Tsukaguchi, Hon-machi, Amagasaki, Hyogo 661-8661, Japan
Brief Biographical History:
2008- Joined Mitsubishi Electric Corp.
2012 Received Ph.D. in Information Science from Hokkaido University
2015- Joined National Institute of Advanced Industrial and Technology
(AIST)
Main Works:
• “Fast Graspability Evaluation on Single Depth Maps for Bin Picking
with General Grippers,” Proc. of ICRA, pp. 1997-2004, 2014.
• “Development of Production Robot System that can Assemble Products
with Cable and Connector,” J. of Robotics and Mechatronics (JRM),
Vol.23, No.6, pp. 939-950, 2011.
Membership in Academic Societies:
• The Robotics Society of Japan (JRSJ)
• The Japan Society for Precision Engineering (JSPE)
• The Institute of Electronics, Information and Communication Engineers
(IEICE)
• Information Processing Society of Japan (IPSJ)
• The Institute of Electrical and Electronics Engineers (IEEE)

Name:
Shun’ichi Kaneko

Affiliation:
Hokkaido University, Graduate School of Infor-
mation Science and Technology

Address:
Kita-14, Nishi-9, Kita-ku, Sapporo 060-0814, Japan
Brief Biographical History:
1991-1996 Associate Professor, Tokyo University of Agriculture and
Technology
1996-2004 Full Professor, Hokkaido University
Main Works:
• “Robust image registration by increment sign correlation,” Pattern
Recognition, Pergamon Press, 2002.
Membership in Academic Societies:
• The Japan Society for Precision Engineering (JSPE)