fernandes_et_al_SIBGRAPI_2005
fernandes_et_al_SIBGRAPI_2005
net/publication/4225382
CITATIONS READS
3 213
4 authors:
All content following this page was uploaded by Leandro A. F. Fernandes on 02 June 2014.
Abstract ages by projecting two parallel laser beams, apart from each
other by a known distance and perpendicular to the camera’s
We present a new method for computing the dimen- image plane, onto one of the visible faces of the box. We
sions of boxes from single perspective projection images demonstrate this technique by building a scanner prototype
in real time. Given a picture of a box, acquired with a for computing box dimensions and using it to compute the
camera whose intrinsic parameters are known, the dimen- dimensions of boxes in real time (Figure 1).
sions of the box are computed from the extracted box sil-
houette and the projection of two parallel laser beams on
one of its visible faces. We also present a statistical model
for background removal that works with a moving camera,
and an efficient voting scheme for identifying approximately
collinear segments in the context of a Hough Transform.
We demonstrate the proposed approach and algorithms by
building a prototype of a scanner for computing box dimen-
sions and using it to automatically compute the dimensions
of real boxes. The paper also presents some statistics over
Figure 1. Scanner prototype: (left) Its oper-
measurements obtained with our scanner prototype.
ation. (right) Camera’s view with recovered
edges and vertices.
1 Introduction
The main contributions of this paper include:
The ability to measure the dimensions of three-
• An algorithm for computing the dimensions of boxes
dimensional objects directly from images has many prac-
in a completely automatic way in real-time (Section 3);
tical applications including quality control, surveillance,
analysis of forensic records, storage management and cost • An algorithm for extracting box silhouettes in the pres-
estimation. Unfortunately, unless some information relating ence of partial occlusion of the box edges (Section 3);
distances measured in image space to distances measured in
3D is available, the problem of making measurements di- • A statistical model for detecting background pixels un-
rectly on images is not well defined. This is a result of the der different lighting conditions, for use with a moving
inherent ambiguity of perspective projection caused by the camera (Section 4);
loss of depth information.
• An efficient voting scheme for identifying approxi-
This paper presents a method for computing boxes di-
mately collinear line segments using a Hough Trans-
mensions from single perspective projection images in a
form (Section 5).
completely automatic way. This can be an invaluable tool
for companies that manipulate boxes on their day-by-day
operations, such as couriers, airlines and warehouses. The 2 Related Work
approach uses information extracted from the silhouette of
the target boxes and can be applied when at least two of Many optical devices have been created for making mea-
their faces are visible, even when the target box is partially surements in the real world. Those based on active tech-
occluded by other objects in the scene (Figure 1). We elimi- niques project some kind of energy onto the surfaces of the
nate the inherent ambiguity associated with perspective im- target objects and analyze the reflected energy. Examples of
active techniques include optical triangulation [1] and laser 3 Computing Box Dimensions
range finding [18] to capture the shapes of objects at proper
scale [13], and ultrasound to measure distances [6]. In con- We model boxes as parallelepipeds although real boxes
trast, passive techniques rely only on the use of cameras for can present many imperfections (e.g., bent edges and cor-
extracting the three-dimensional structure of a scene and ners, asymmetries, etc.). The dimensions of a paral-
are primarily based on the use of stereo [14]. In order lelepiped can be computed from the 3D coordinates of four
to achieve metric reconstruction [9], both optical triangu- of its non-coplanar vertices. Conceptually, the 3D coordi-
lation and stereo-based systems require careful calibration. nates of the vertices of a box can be obtained by intersect-
For optical triangulation, several images of the target object ing rays, defined by the camera’s center and the projections
with a superimposed moving pattern are usually required of the box vertices on the camera’s image plane, with the
for more accurate reconstruction. planes containing the actual faces of the box in 3D. Thus,
before we can compute the dimensions of a given box (Sec-
Labeling schemes for trihedral junctions [3, 11] have tion 3.3), we need to find the projections of the vertices on
been used to estimate the spatial orientation of polyhedral the image (Section 3.1), and then find the equations of the
objects from images. These techniques tend to be com- planes containing the box faces in 3D (Section 3.2).
putationally expensive when too many junctions are identi- In the following derivations, we assume that the origin of
fied. Additional information from the shading of the objects the image coordinate system is at the center of the image,
can be used to improve the labeling process [21]. Silhou- with the X-axis growing to the right and the Y-axis growing
ettes have been used in both computer vision and computer down, and assume that the imaged boxes have three visible
graphics for object shape extraction [12, 17]. These tech- faces. The case involving only two visible faces is similar.
niques require precise camera calibration and use silhou- Also, assume that the images used for computing the dimen-
ettes obtained from multiple images to define a set of cones sions were obtained through linear projection (i.e., using a
whose intersections approximate the shapes of the objects. pinhole camera). Although images obtained with real cam-
eras contain some amount of radial distortion, such distor-
Criminisi et al. [4] presented a technique for making 3D tions can be compensated for with the use of simple warping
affine measurements from a single perspective image. They procedures [9].
show how to compute distances between planes parallel to
a reference one. In case of some distance from a scene ele- 3.1 Finding the Projections of the Vertices
ment to the reference plane is known, it is possible to com-
pute the distances between scene points and the reference The projection of the vertices can be obtained as the cor-
plane. The technique requires user interaction and cannot be ners of the box silhouette. Although edge detection tech-
used for computing dimensions automatically. Photogram- niques [2] could be used to find the box silhouette, these
metrists have also made measurements based on single im- algorithms tend to be sensitive to the presence of other high-
ages. However, these techniques can only be applied to pla- frequency contents in the image. In order to minimize the
nar objects and require user intervention. occurrence of spurious edges and support the use of boxes
with arbitrary textures, we perform silhouette detection us-
In a work closely related to ours, Lu [16] described a ing a model for the background pixels. Since the images
method for finding the dimensions of boxes from single are acquired using a handheld camera, proper modeling of
gray-scale images. In order to simplify the task, Lu assumes the background pixels is required and will be discussed in
that the images are acquired using parallel orthographic pro- detail in Section 4.
jection and that three faces of the box are visible simultane- However, as shown in Figure 2 (a, b and c), a naive ap-
ously. The computed dimensions are approximately correct proach that just models the background and applies simple
up to a scaling factor. Also, special care is required to dis- image processing operations, like background removal and
tinguish the actual box edges from lines in the box texture, high-pass filtering, does not properly identify the silhouette
causing the method not to perform in real time. pixels of the target box (selected by the user by pointing the
laser beams onto one of its faces). This is because the scene
Our approach computes the dimensions of boxes from may contain other objects, whose silhouettes possibly over-
single perspective projection images, producing metric re- lap with the one of the target box. Also, the occurrence of
constructions in real time and in a completely automatic some misclassified pixels (see Figure 2, c) may lead to the
way. The method can be applied to boxes with arbitrary detection of spurious edges. Thus, a suitable method was
textures, can be used when only two faces of the box are developed to deal with these problems. The steps of our
visible, even when the edges of the target box are partially algorithm are shown in Figures 2 (a, d, e, f and g).
occluded by other objects in the scene. The target box silhouette is obtained starting from one
Figure 2. Identifying the target box silhouette. Naive approach: (b) Background segmentation,
followed by (c) High-pass filter (note the spurious "edge" pixels). Proposed approach: (d) Contouring
of the foreground region, (e) Contour segmentation, (f) Grouping candidate segments for the target
box silhouette, and (g) recovery of supporting lines for silhouette edges and vertices.
of the laser dots, finding a silhouette pixel and using a houette. In order to be considered a valid combination, the
contour-following procedure [8]. The seed silhouette pixel groups must satisfy the following validation rules: (i) they
for the contour-following procedure is found stepping from must characterize a convex polygon; (ii) the silhouette must
the laser dot within the target foreground region and check- have six edges (the silhouette of a parallelepiped with at
ing whether the current pixel is matches the background least two visible faces); (iii) the laser dots must be at the
model. In order to be a valid silhouette, both laser dots need same box face; and (iv) the computed lengths for pairs of
to fall inside of the contouring region. Notice this procedure parallel edges in 3D must be approximately the same. In
produces a much cleaner set of border pixels (Figure 2, d) the case of more than one combination of groups pass the
compared to results shown in Figure 2 (c). But the resulting validation tests, the system discards this ambiguous data
silhouette may include overlapping objects, and we need to and starts processing a new frame (our system is capable
identify which border pixels belong to the target box. To fa- of processing frames at the rate of about 34 fps).
cilitate the handling of the border pixels, the contour is sub- Once the box silhouette is known, the projections of the
divided into its most perceptually significant straight line six vertices are obtained intersecting pairs of adjacent sup-
segments [15] (Figure 2, e). Then, the segments resulting porting lines for the silhouette edges (Figure 2, g). Section 5
from the clipping of a foreground object against the limits discusses how to obtain those supporting lines.
of the frame (e.g., segments h and i in Figure 2, e) are dis-
carded. Since a box silhouette defines a convex polygon, the 3.2 Computing the Plane Equations
remaining segments whose two endpoints are not visible by
both laser dots can also be discarded. This test is performed The set of all parallel lines in 3D sharing the same di-
using a 2D BSP-tree [7]. In the example of Figure 2, only rection intersect at a point at infinite whose image under
segments c, d, e, o, p and q pass this test. perspective projection is called a vanishing point ω. The
line defined by all vanishing points from all sets of paral-
Still, there is no guarantee that all the remaining seg- lel lines on a plane Π is called the vanishing line λ of Π.
ments belong to the target box silhouette. In order to re- The normal vector to Π in a given camera’s coordinate sys-
strict the amount of possible combinations, the remaining tem can be obtained multiplying the transpose of the cam-
chains of segments defining convex fragments are grouped era’s intrinsic-parameter matrix by the coefficients of λ [9].
(e.g., groups A and B in Figure 2, f). We then try to find the Since the resulting vector is not necessarily a unit vector, it
largest combination of groups into valid portions of the sil- needs to be normalized. Equations (1) and (2) show the re-
lationship among the vanishing points ωi , vanishing lines λi
and the supporting lines ej for the edges that coincide with
the imaged silhouette of a parallelepiped with three visible
faces. The supporting lines are ordered clockwise.
ωi = ei × ei+3 (1)
λi = ωi × ω(i+1)mod3 (2)
where 0 ≤ i ≤ 2, 0 ≤ j ≤ 5, λi = (aλi , bλi , cλi )T and ×
is the cross product operator. The normal NΠi to plane Πi
is then given by
RK T λi Figure 3. Top view of a scene. Two laser
N Πi = (3)
RK T λi beams apart in 3D by dlb project onto one box
face at points P0 and P1 , whose distance in 3D
where NΠi = (AΠi , BΠi , CΠi ), 0 ≤ i ≤ 2. K is the matrix
is dld . α is the angle between −L and NXZ .
that models the intrinsic camera parameters [9] and R is a
reflection matrix (Equation 4) used to make the Y-axis of
the image coordinate system grows in the up direction.
where
f
γ ox 1 0 0 AΠ xp1 + BΠ yp1 + CΠ
sx
k= (9)
K = 0 sfy oy R = 0 −1 0 (4) AΠ xp0 + BΠ yp0 + CΠ
0 0 1 0 0 1
Now, let dlb and dld be the distances, in 3D, between
In Equation (4), f is the focal length, and sx and sy are the two parallel laser beams and between the two laser dots
the dimensions of the pixel in centimeters. γ, ox and oy rep- projected onto one of the faces of the box, respectively (Fig-
resent the skew and the coordinates of the principal point, ure 3). Section 6 discusses how to find the laser dots on the
respectively. image. dld can be directly computed from NΠ , the nor-
Once we have NΠi , finding DΠi , the fourth coefficient mal vector of the face onto which the dots project, and the
of the plane equation, is equivalent to solving the projec- known distance dlb :
tive ambiguity and will require the introduction of one more
dlb dlb
constraint. Thus, consider the situation depicted in 2D in dld = = (10)
Figure 3 (right), where two laser beams, parallel to each cos(α) −(NXZ · L)
other and to the camera’s XZ plane, are projected onto one
where α is the angle between NXZ , the normalized projec-
of the faces of the box. Let the 3D coordinates of the laser
tion of NΠ onto the plane defined by the two laser beams.
dots defined with respect to the camera coordinate system
By construction, such a plane is parallel to the camera’s
be P0 = (XP0 , YP0 , ZP0 )T and P1 = (XP1 , YP1 , ZP1 )T ,
XZ plane. Therefore, NXZ is obtained by dropping the
respectively (Figure 3, left). Since P0 and P1 are on the
Y coordinate of NΠ and normalizing the resulting vector.
same plane Π, one can write
L = (0, 0, 1)T is the vector representing the laser beam di-
AΠ XP0 + BΠ YP0 + CΠ ZP0 = AΠ XP1 + BΠ YP1 + CΠ ZP1 rection. dld can also be expressed as the Euclidean distance
(5) between the two laser dots in 3D:
Using the linear projection model and given
pi = (xpi , ypi , 1)T , the homogeneous coordinates d2ld = (XP1 − XP0 )2 + (YP1 − YP0 )2 + (ZP1 − ZP0 )2 (11)
of the pixel associated with the projection of point Pi , one
can reproject pi on the plane Z = 1 (in 3D) using Substituting Equations (7), (8) and (10) into (11) and solv-
ing for ZP1 , one gets
pi = RK −1 pi (6)
and express the 3D coordinates of the laser dots on the face d2ld
ZP1 = (12)
of the box as ak 2 − 2bk + c
XPi = xpi ZPi , YPi = ypi ZPi and ZPi (7) where a = (xp0 )2 + (yp0 )2 + 1, b = xp0 xp1 + yp0 yp1 + 1
Substituting the expression for XP0 , YP0 , XP1 and YP1 and c = (xp1 )2 +(yp1 )2 +1. Given ZP1 , the 3D coordinates
(Equation 7) in Equation (5) and solving for ZP0 , we obtain of P1 can be computed as
ZP0 = kZP1 (8) P1 = (XP1 , YP1 , ZP1 ) = (xp1 ZP1 , yp1 ZP1 , ZP1 ) (13)
The projective ambiguity can be finally removed by com- the first step, we compute E, the average color of all pixels
puting the DΠ coefficient for the plane equation of the face in images Ii , and the eigenvalues and eigenvectors associ-
containing the two dots: ated with the colors of those pixels. E and the eigenvector
associated with the highest eigenvalue define an axis in the
DΠ = −(AΠ XP1 + BΠ YP1 + CΠ ZP1 ) (14) RGB color space, called the chromaticity axis. The chro-
maticity distortion d of a given color C can be computed as
3.3 Computing the Box Dimensions the distance from C to the chromaticity axis.
After discarding the pixels whose projections on the
Having computed the plane equation, one can recover chromaticity axis have at least one saturated channel (they
the 3D coordinates of vertices of that face. For each such lend to misclassification of bright foreground pixels), we di-
vertex v on the image, we compute v using Equation (6). vide the chromaticity axis into m slices (Figure 4). For each
We then compute its corresponding ZV coordinate by sub- slice, we compute d¯j and σd¯j , the mean and the standard de-
stituting Equation (7) into the plane equation for the face. viation, respectively, for the chromaticity distortion of the
Given ZV , both XV and YV coordinates are computed us- pixels in the slice. Then, we compute a threshold dT j for
ing Equation (7). Since all visible faces of the box share the maximum acceptable slice chromaticity distortion con-
some vertices with each other, the D coefficients for the sidering a confidence level of 99% as dT j = d¯j + 2.33σd¯j .
other faces of the box can also be obtained, allowing the Finally, the coefficients of the polynomial background
recovery of the 3D coordinates of all vertices on the box model are computed by fitting a curve through the dT val-
silhouette, from which the dimensions are computed. ues at the centers of the slices. Once the coefficients have
Although not required for computing the dimensions of been computed, the dT values are discarded and the tests
the box, the 3D coordinates of the inner vertex (see Fig- are performed against the polynomial. Figure 4 illustrates
ure 1, right) can also be computed. Its 2D coordinates can the case of a color Ck being tested against the background
be obtained as the intersection between three lines (Fig- color model. Ck is the projection of Ck on the chromaticity
ure 1, right). Each such a line is defined by a vanishing axis. In this example, as the distance between Ck and the
point and the silhouette vertex falling in between the two chromaticity axis is bigger than the threshold defined by the
box edges used to compute that vanishing point. Since it is polynomial, Ck will be classified as foreground.
unlikely that these three lines will intersect exactly at one
point, we approximate it using least-squares. Given the in-
ner vertex 2D coordinates, its corresponding 3D coordinates
can be computed using the same algorithm used to compute
the 3D coordinates of the other vertices.
The geometry of the box is somewhat different from that Table 1. Confidence intervals for the measure-
of a parallelepiped because of imperfections introduced dur- ments for the box in Figure 6 (top right)
ing the construction process and handling. For instance,
bent edges, different sizes for two parallel edges of the same CI(γ) dim 1 dim 2 dim 3
face, lack of parallelism between faces expected to be par- CI(80%) [27.99, 29.60] [21.83, 22.60] [15.12, 15.82]
allel, and warped corners are not unlikely to be found in CI(90%) [27.76, 29.83] [21.72, 22.72] [15.02, 15.92]
practice. Such inconsistencies lend to errors in the orien- CI(95%) [27.55, 30.03] [21.62, 22.81] [14.93, 16.01]
tation of the silhouette edges, which are cascaded into the CI(99%) [27.39, 30.20] [21.54, 22.89] [14.86, 16.08]
computation of the box dimensions.
In order to estimate the inherent inaccuracies of the Another estimate of the error can be expressed as the
proposed algorithm, we implemented a simulator that per- relative error = σ/x̄. Table 2 shows data about the di-
forms the exact same computations, but on images gener- mensions of five other boxes and the relative errors in the
measurements obtained with our scanner prototype. The er- References
rors were computed with respect to the average of the com-
puted dimensions over a set of 20 measurements. In these [1] P. Besl. Active Optical Range Imaging Sensors. Advances
experiments, the operator tried to keep the device still while in Machine Vision, in Jorge Sanz (editor), pages 1–63.
the samples were collected. The relative errors varied from Springer-Verlag, 1988.
0.20% up to 11.20%, which is in accordance with the ac- [2] J. Canny. A computational approach to edge detection. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
curacy predicted by the experiment summarized in Table 1.
8(6):679–698, November 1986.
[3] M. B. Clowes. On seeing things. Artificial Intelligence,
2:79–116, 1971.
[4] A. Criminisi, I. Reid, and A. Zisserman. Single view metrol-
Table 2. Relative errors for five real boxes. ogy. In Proceedings of the 7th IEEE International Confer-
Real Size (cm) Relative Error (%) ence on Computer Vision (ICCV-99), volume I, pages 434–
dim 1 dim 2 dim 3 n dim 1 dim 2 dim 3 441, Kerkyra, Greece, September 20-27 1999. IEEE.
35.5 32.0 26.9 68 1.44 2.68 1.40 [5] R. O. Duda and P. E. Hart. Use of the Hough transformation
to detect lines and curves in pictures. Communications of the
28.6 24.1 16.8 61 6.81 0.80 11.20
ACM, 15(1), 1972.
36.0 25.2 13.8 50 5.19 10.15 9.43 [6] F. Figueroa and A. Mahajan. A robust method to determine
29.9 29.1 22.9 69 0.20 5.47 10.22 the coordinates of a wave source for 3-D position sensing.
48.2 45.5 28.3 20 3.98 4.55 0.91 ASME Journal of Dynamic Systems, Measurements and Con-
trol, 116:505–511, September 1994.
[7] H. Fuchs et al. On visible surface generation by a priori tree
structures. In Proc. of SIGGRAPH 1980, pages 124–133.
[8] J. Gauch. KUIM, image processing system.
8 Conclusions and Future Work http://www.ittc.ku.edu/ jgauch/research, 2003.
[9] R. I. Hartley and A. Zisserman. Multiple View Geometry in
Computer Vision. Cambridge University Press, 2000.
[10] T. Horprasert, D. Harwood, and L. S. Davis. A statis-
We have presented a completely automatic approach for tical approach for real-time robust background subtraction
computing the dimensions of boxes from single perspective and shadow detection. In Proc. of the 7th IEEE ICCV-99,
projection images in real time. The approach uses infor- FRAME-RATE Workshop, 1999.
mation extracted from the silhouette of the target box and [11] D. A. Huffman. Impossible objects as nonsense sentences. In
removes the projective ambiguity with the use of two par- Machine Intelligence 6, pages 295–324. Edinburg University
Press, 1971.
allel laser beams. We demonstrated the effectiveness of the [12] A. Laurentini. The visual hull concept for silhouette-based
proposed techniques by building a prototype of a scanner image understanding. IEEE Trans. on PAMI, 16(2):150–162,
and using it to compute the dimensions of several real boxes February 1994.
even when the edges of the target box are partially occluded [13] M. Levoy et al. The digital Michelangelo project: 3D scan-
by other objects and under different lighting conditions. We ning of large statues. In Proc. of SIGGRAPH 2000, pages
also presented a statistical discussion of the measurements 131–144, 2000.
[14] H. C. Longuet-Higgins. A computer algorithm for recon-
made with our scanner prototype. structing a scene from two projections. Nature, 293:133–
We have also introduced an algorithm for extracting 135, September 1981.
box silhouettes in the presence of partially occluded edges, [15] D. G. Lowe. Three-dimensional object recognition from sin-
an efficient voting scheme for grouping approximately gle two-dimensional images. Artificial Intelligence, 31:355–
collinear segments using a Hough Transform, and a statis- 395, March 1987.
[16] K. Lu. Box dimension finding from a single gray-scale im-
tical model for background that works with a moving cam- age. Master’s thesis, SUNY Stony Brook, New York, 2000.
era. Our algorithm for computing box dimensions can still [17] W. Matusik, C. Buehler, R. Raskar, S. J. Gortler, and
be used with applications requiring heterogeneous back- L. McMillan. Image-based visual hulls. In Proc. of SIG-
grounds. In these situations, background detection can be GRAPH 2000, pages 369–374, 2000.
performed using a technique like the one described in [10]. [18] L. Nyland et al. The impact of dense range data on com-
puter graphics. In Proc. of Multi-View Modeling and Analy-
In this case, the camera should remain static while the boxes
sis Workshop (MVIEW99) (Part of IEEE CVPR99), 1999.
are moved on some conveyor belt. [19] Point Grey Research Inc. Dragonfly IEEE-1394 Digital
We believe that these ideas may lend to optimizations Camera System, 2.0.1.10 edition, 2002.
on several procedures that are currently based on manual [20] P. Vlahos. Composite color photography. U.S. Patent
measurements of box dimensions. We are currently explor- 3.158.477, 1964.
[21] D. L. Waltz. Generating semantic descriptions from draw-
ing ways of using arbitrary backgrounds, the use of a single ings of scenes with shadows. Technical Report MAC-TR-
laser beam, and analyzing the error propagation through the 271, Cambridge, MA, USA, 1972.
various stages of the algorithm.