Head-Pose-Determination-from-One-Image-Using-a-Generic-Model
Head-Pose-Determination-from-One-Image-Using-a-Generic-Model
Ikuko Shimizu1 3 ;
Zhengyou Zhang2 3 ;
Shigeru Akamatsu3 Koichiro Deguchi1
1
Faculty of Engineering, University of Tokyo, 7-3-1 Hongo, Bynkyo-ku, Tokyo 113, Japan
2
INRIA, 2004 route des Lucioles, BP 93, F-06902 Sophia-Antipolis Cedex, France
3
ATR HIP, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan
e-mail: [email protected]
1
= P 1
; or simply x̃ = P X̃ : (1)
results of measuring heads of many people. To do so, we need
a method for sampling points consistently for all faces. That
is, we need to know which point on a face corresponds to a
where is an arbitrary scale factor, P is a 3 4 matrix, called point on another face. Many methods have been proposed for
X̃
the perspective projection matrix, and = (X; Y; Z; 1)t and such a purpose and we can use them; we use the resampling
x̃= (u; v; 1)t . The matrix P can be decomposed as method [4] developed in our laboratory. This method uses
several feature points (such as the corners of the eyes, the
P = AT: (2) vertex of the nose, and so on) as reference points. Using
these reference points, the shape of a face is segmented into
The matrix A maps the coordinates of the 3D point to the several regions and further each region is resampled. We
image coordinates. The general matrix A can be written as choose the sample points using this method.
0 uo
1 3.2. Edge Extraction in the Model
u 0 0
A = @ 0 v vo 0 A: (3) As mentioned earlier, we use two types of edges: stable
0 0 1 0 edges and variable edges. For stable edges, we extract them
beforehand from the 2D image taken at the same time as the
u and v are the product of the focal length and the hori- acquisition of the 3D data of a head. They are the contours of
zontal and vertical scale factors, respectively. uo and vo are the eyes, lips, and eyebrows. We obtain their corresponding
the coordinates of the principal point of the camera, i.e., the curves on the head by back-projecting them onto the 3D
intersection between the optical axis and the image plane. model. For variable edges, which are occluding contours and
For simplicity of computation, both uo and vo are assumed depend on the head pose and camera parameters, we extract
to be 0 in our case, because the principal point is usually at them whenever these parameters change. Figure 1 shows
the center of the image. an example of images of the generic model with stable and
The matrix T denotes the positional relationship between variable edges. It shows that the stable edges (i.e., the eyes
the world coordinate system and the image coordinate system. and lips) do not change under the change of the pose, and the
T can be written as
R t
variable edge (i.e., the contour of the face ) changes whenever
the pose changes.
T =
0 1 : (4)
with @ a A =P X W
the ICC method which we will present in the later section,
xWj l (P )= a12=a33
1
j :
2
l (5) and it is also used for finding corresponding curves.
a3 The squared distance between a 2D curve in the image and
a projection of a curve on the 3D model is defined by
3. Generic Model of a Human Head
d(CkI ; ClW (P )) !
We use the generic model of the human head which is able
X
to take account of shape differences between individuals and
=
1
NkI dm (xIi k , xWj l (P )) ;
xWj l 2ClW P
the changes of the facial expression. This section explains min (6)
this generic model. xIk 2C I
i k
( )
where NkI is the number of points in CkI and dm ( Ii k x , 5.2. Calculating the Strength of Match
xj
Wl (P )) is the squared Mahalanobis distance:
If ((CkI ; ClW (Po ))) is a correct pair, many of the rest of the
model curves CkWm have corresponding curve ClIn such that the
dm (xIi k , xWj l (P )) position of ClIn relative to CkWm is similar to that of ClW (Po )
= (xi k , xj l (P ))t M ij (xi k , xj l (P ));
I W kl I W (7) relative to CkI . We define the strength of match SM for pair
0 Wl ! 1
Wl (P ) !t ,
((CkI ; ClW (Po ))) in a way similar to the one for point pair used
x x
1
@ P @
M klij = @ @ X Wl V [X Wj l ] @ X Wl A :(8)
j ( ) j in [10].
j j 5.3. Updating corresponding pairs of curves
The strategy we use for updating corresponding pairs is
It is possible to give another definition of the distance called the “some-winners-take-all” strategy[10]. Consider
between curves. Our definition is based on the following the corresponding pairs having the highest strength of match
assumptions: for both of the image and the model. These pairs are called
potential matches and denoted by fPi g. For fPi g, two tables
When, for edges on the 3D model, the corresponding
TSM and TUA are constructed.
edges in the image are found, the projected model curve
TSM saves the matching strength of each fPi g which is
sorted in decreasing order. TUA saves the value of UA . UA
contains the image curve.
The generic model is sampled at a higher resolution than describes unambiguity and is defined as
the image. UA = 1 , SM (2)=SM (1) (10)
The variance V [X ] of each point can be different and where SM (1) is the SM of fPi g and SM (2) is the SM of the
unisotropic. second best candidate in the pairs which include the curve
forming fPi g . TUA is also sorted in decreasing order.
5. Finding Corresponding Curves by Relaxation The pairs are selected as “correct” matches if they are
among the first q(> 50) percent of pairs in TSM and the first
In this section, we explain the method for finding corre-
spondence between 3D model curves and 2D image curves.
q percent of pairs of TUA . Using this method, the pairs which
This is done by matching 2D image curves CkI and model
are matched well and unambiguous are selected.
curves ClW (Po ) projected by Po . Po is an arbitrary projec- 6. Rough Estimation of a Head Pose
tion.
We assume that all of the eyes and lips are seen in an In this section, we explain the method for roughly estimat-
image. Therefore, the edges of a 2D image are expected to ing the head pose and camera parameters which are used as
include stable edges. However, they also include noisy edges the initial guess in the refinement process.
caused by illumination, measurement error and so on. Conse- To roughly estimate the head pose and the camera param-
quently, there are some correspondence ambiguities. We use eters, co-planar conics are used. Because the eyes and mouth
the relaxation techniques to resolve such the correspondence are approximately on a single plane, the 3D stable edges of
ambiguities. the model, such as the edges of the eyes and lips, are projected
First, we find candidates for corresponding curves using to that plane.
the similarity of the curvature. Curvature of the curve is not We use the intersections and bi-tangent lines of the co-
preserved under projection. However, because we assume planar conics because they are preserved under projection[2].
the pose estimate Po is reasonable, curvature of the same When using pairs of co-planar conics, at least one pair of co-
curve might be similar. After finding candidates, we resolve planar conics is needed to determine all the parameters. But
ambiguities by relaxation method. still remain two possibilities in our case: the correct one and
5.1. Finding Candidates for Corresponding the upside-down one. Therefore, we use three pairs of conics:
Curves left eye and right eye, left eye and lips, right eye and lips.
Both the image edges and the projected model edges are
6.1. Projection to the Face Plane
segmented into equi-curvature curves. Candidates for cor- Edge points of eyes and lips are almost on one plane, called
responding pairs are found by evaluating the similarity of the face plane.
curvature. Consider a coordinate system in which the face plane co-
The similarity of curvature s(k; l) is defined as incides with z = 0. We call such a coordinate system the
plane coordinate system.
s(k; l) = 1:0=(1:0 + jc(CkI ) , c(ClW (Po ))j); (9) The 3D coordinates of a point projected to the face plane
X in the world coordinate system and the coordinates of the
where c(C ) is the curvature of curve C . point (xp ; yp ; 0)t in the plane coordinate system are related
s(k; l) has the following properties:(i) when two curves by
have exactly the same curvature, s(k; l) equals 1, and (ii) as 0 10 x 1 0x 1
p
the difference of the curvature between two curves becomes
larger, s(k; l) becomes smaller. X̃ = B
B@ Rp tp C B y
CA B@ 0p CC = T BB ypp CC ;
If the value of s(k; l) is higher than the threshold, the pair A p@ 0 A (11)
of curves (CkI ; ClW (Po )) is selected as the candidate pair. 0 1 1 1
where Tp means the positional relationship between the world We select the best one among all possible values of H
coordinate system and the plane coordinate system. by evaluating H . The method for evaluation is descried in
From equations (1) and (11), we have appendix B.
0x 1 From equation (13), unknown parameters are obtained
using every components of H (see appendix C). This is the
p
x̃ = H @ yp A ; (12) initial guess for refinement process.
1
7. Refinement of the Head Pose by ICC (Iterative
where H is a 3 3 matrix, given by Closest Curve) Method
0 r 1
u u r12 u t1 In this section, we explain the method for refinement of
H = @ vr A;
11
v r22 v t2
the head pose and camera parameters using the initial guess
21 (13)
r31 r32 t3 obtained by the method described in the previous section.
We employ the ICC method which minimizes the distance
R0 = RRp, and ti is
between corresponding curves.
where rij is the (i; j )-th component of We use the correspondence of two types of edges in this
t Rt t
the i-th component of 0 = p+ . process: stable ones and variable ones.
6.2. Intersection and bi-tangent of co-planar The correspondence of stable edge curves have been es-
conics tablished by the method described in section 5.
Variable edges, e.g. the contour of the face, of the generic
A conic in a 2D space is a set of points x that satisfy model should be extracted whenever the parameters are up-
dated because this curve varies whenever the parameters
x̃tQx̃ = 0; (14) change. However, the correspondence of the contour of the
face is known.
where Q is a 3 3 symmetric matrix. We fit a conic to Once the correspondence of the curves are established,
the edge points of right eye, left eye, and lips by gradient the squared Mahalanobis distance of corresponding curves is
weighted least square fitting described in [9]. minimized.
m̃ Q Q We minimize the value of the function J :
The intersection for two conics 1 and 2 satisfies the
X
following simultaneous equations:
J = d(CkI ; ClW (k)(P )) + d(CoI ; CoW (P )) (17)
m̃tQ1m̃ = 0; m̃tQ2m̃ = 0: l !
X X
and (15)
Q Q =
NkI
1
dm (xIi k , xW
j (P ))
l
xWj l 2ClW P
Denoting bi-tangent line for two conics 1 and 2 as min
t
l̃ x̃ l̃
= 0, satisfies the following simultaneous equations[2]:
l x Ik 2C I
i k
( )
!
X
l̃tQ,1 1l̃ = 0; and l̃tQ,2 1l̃ = 0: (16) +
1
NoI xIo k 2CI xWomin
2CW (P )
dm (xIi o , xW
j (P ))
o :
j o
m and l are obtained by solving quartic equations analyt- i o
(18)
ically.
6.3. Combinations of the Correspondence We minimize the value of J to find P by iterating these
two steps:
There are no real intersections for these pairs of conics.
Therefore, the solutions of the quartic equation are two com- x
For each image point Ii k of each corresponding curve
plex conjugate pairs. In complex cases, there are eight possi- pairs (CkI ; ClW (P )), the point W l x
j which minimize
bilities to correspond four points of the image to four points
of the model because conjugate pairs project to conjugate x I W x
dm ( i , j ) are found.
k l
pairs under real projection[2].
On the other hand, because all of the bi-tangent lines are
P is updated to minimize J by Levenverg-Marquart
algorithm.
real in this case, there are only four possibilities of correspon-
dence. P include the head pose and camera parameters in equa-
Therefore, there are 32 possible combinations for each pair tion 2. We directly estimate eight parameters, i.e., three
of conics. When we use three pairs of conics, the number of rotation parameters, three translation parameters, and two
the all possible pairs are 323 (= 32768). camera parameters, instead of each component of P .
We reduce the number of combinations. Because there are
7.1. Non-linear Minimization of the Dis-
two possibilities are remaining in our case for only one pair
of conics (the true one and the up-side-down one), we select tance between Curves
two combinations for each pair of conics. Then using these From equation (2), P is decomposed into a perspective
combinations for three pairs of conics, all possible values projection and the rigid displacement.
of H is calculated by the linear least squares described in Non-linear minimization with constraints of the rotation
appendix A. The number of possible values of H is much matrix is complicated. Therefore, we rewrite the rigid dis-
reduced to 32 + 32 + 32 + 23 (= 104). placement part by using a 3D vector as q
(a) (b)
Figure 2. (a) Extracted edges in images of one
woman's face and (b) edge curves of the eyes, (a) (b)
lips, and eyebrows extracted by the correspon- Figure 3. Edges and conic of the eyes and lips
dence between the model and the image. and the result of rough estimation using conics.
(a) Edges of a woman's face and co-planar con-
ics. (b) The results of rough estimation using
T X̃ = RX + t (19) conics of (a). The conics of the image are plot-
ted in black and the projection of model conics
= X+ (q X , (q X ) q ) + t:(20)
2
1 + qt q
are plotted in red.
The direction of q is equal to the rotation axis and the norm
of q is equal to tan 2 where is the rotation angle. Using this
equation, because the three component of q are independent,
the minimization becomes much simpler.
8. Experimental Result
We show in this section some preliminary result with the
proposed technique.
Figure 1 shows the model edges constructed from 36 Figure 4. The result of the head pose estimation
women’s head measurements. All of these are with no facial using ICC.
expressions. Figure 2(a) shows the edges of an image of one
woman. These edges are extracted by the method described
in [6].
Figure 2(b) shows the extracted stable edge curves, i.e., We have proposed the iterative closest curve matching
the contour of the eyes, lips, and eyebrows. These edges are (ICC) method which estimates directly the pose by itera-
extracted by establishing the correspondence between model tively minimizing the squared Mahalanobis distance between
edges and image edges described in section 5. the projected model curves and the corresponding curves in
Figure 3(a) shows the co-planar conics fitted to contours of the image. The curve correspondence is established by the
the eyes and lips in the image showed in figure 2(a). Figure relaxation technique. Because a curve contains much richer
3(a) shows the result of rough estimation. The conics of information than a point, curve correspondences can be estab-
the model are plotted in red and the conics of the image are lished more robustly and with less ambiguity, and therefore,
plotted in black. pose estimation based the ICC is believed to be more accurate
The head pose and camera parameters of the image shown than that based on the well-known ICP.
in Fig. 2(a) was estimated. Figure 4 shows the projection of Furthermore, our technique does not assume that the in-
the generic model by the estimated parameters. The pose of ternal parameters of a camera is known. This provides more
the head shown in Fig. 2(a) and that of Fig. 4 are almost the flexibility in practice because an uncalibrated camera can be
same. used. The unknown parameters are recovered by our tech-
nique thanks to the generic 3D model.
9. Conclusion Preliminary experimental results show that (i) accurate
head pose can be estimated by our method using the generic
Head pose determination is very important for many ap- model and (ii) this generic model can deal with the shape
plications such as human-computer interface and video con- difference between individuals. The accuracy of the pose
ferencing. In this paper, we have proposed a new method estimation depends tightly on whether image curves can be
for estimating accurately a head pose from only one image. successfully extracted. More experiments need to be carried
To deal with shape variation of heads among individuals and out for different facial expressions and for cluttered back-
different facial expressions, we use a generic 3D model of ground.
the human head, which was built through statistical analysis We believe that the ICC method is useful not only for
of range data of many heads. In particular, we use a set of 3D-2D pose estimation but also for 2D-2D or 3D-3D pose
3D curves to model the contours of eyes, lips, and eyebrows. estimation.
Acknowledgment: Using this relation, the criterion function e(H ) is defined
We thank K.Isono for his help in the presentation of ex- as[2]:
perimental data.
e(H ) = (Iab3 (QPm01 ; QP1 ) , 3)2 + (Iab4 (QPm01 ; QP1 ) , 3)2
+(Iab3 (Qm2 ; Q2 ) , 3) + (Iab4 (Qm2 ; Q2 ) , 3)
References P0 P 2 P0 P 2
and
h, ,1 (1= det B)Bi(27)
J.L.Mundy. Relative Motion and Pose from Arbitrary Plane
Curves. IVC, 10(4):250–262, 1992.
Iab3 (A; B ) = trace (1= det A)A ;
h, i
[3] D.J.Beymer. Face Recognition Under Varying Pose. In
CVPR94, pages 756–761, 1994.
[4] K.Isono and S.Akamatsu. A Representation for 3D Faces with Iab4 (A; B ) = trace (1= det B )B
,1 (1= det A)A (28)
:
Better Feature Correspondence for Image Generation using
PCA. Technical Report HIP96-17, IEICE, 1996.
[5] P.J.Besl and N.D.McKay. A Method for Registration 3-D
C. Decomposition of H
Shapes. IEEE Trans. PAMI, 14(2):239–256, 1992. From equation (13), the head pose and camera parameters
are determined using every components of H .
R
[6] R.Deriche. Using Canny’s Criteria to Derive a Recursively
Implemented Optimal Edge Detector. IJCV, 1(2):167–187, Because is a rotation matrix, we have
1987.
[7] T.S.Jebara and A.Pentland. Parametrized Structure from Mo- r112 + r212 + r312 = 1; (29)
tion for 3D Adaptive Feedback Tracking of Faces. In CVPR97, r122 + r222 + r322 = 1; (30)
pages 144–150, 1997. r11 r12 + r21 r22 + r31 r32 = 0: (31)
[8] Z.Zhang. Iterative Point Matching for Registration of Free-
Form Curves and Surfaces. IJCV, 13(2):119–152, 1994. We use hij to denotes the (i; j )-th component of H . From
[9] Z.Zhang. Parameter Estimation Techniques: A Tutorial with equations 13 and 31, we have
Application to Conic Fitting. IVC, 15:59–76, 1997.
[10] Z.Zhang, R.Deriche, O.Faugeras and Q.T.Luong. A Robust h11 h12 = u + h21 h22 = v + h31 h32 = 0:
2 2
(32)
Technique for Matching Two Uncaribrated Images Through
From equations 29 and 30, we also have
,
the Recovery of the Unknown Epipolar Geometry. AI Journal,
78:87–119, 1995.
2 h211 = u + h21 = v + h31
2 2 2 2
1;
H ,
2 h212 =
= (33)
u + h22 = v + h32 1:
2 2 2 2
A. Linear Estimation of = (34)
Assume the image point (x; y ) and the object point Then, by eliminating 2 , we have
(xp ; yp ) are the corresponding pair. We rewrite the com-
ponents of H as h2 , h212 )= u + (h21 , h22 )= v + h31 , h32 = 0: (35)
2 2 2 2 2 2
0a b c1
( 11
H = @ d e f A:
Let u =
1
u and v =
1
v . From equation (32) and (35), we
(21) have
g h 1
,h31h32(h221 , h222) + h21h22(h231 , h232) (36)
By eliminating in equation (12), we get u =
d
axp + byp + c , gxp x , hyp x = x; (22) ,h31h32(h211 , h212) + h11h12(h231 , h232) (37)
v =
d
and
where
dxp + eyp + f , gxp y , hyp y = y: (23)
d = h11 h12 (h221 , h212 ) , h21 h22 (h211 , h212 ): (38)
From equation (22) and equation (23), the components of H
are calculated by the linear least square algorithm. Once u and v are estimated, we can compute using
equation (33) or (34). All of the pose parameters are given
B. Eliminating Ambiguous Solutions for H by
We select the best correspondence combination which r11 = h11 = u ; r21 = h21 = v ; r31 = h31 ; (39)
minimizes the criterion function. r12 = h12 = u ; r22 = h22 = v ; r32 = h32 ; (40)
Q
If the H is correct, conic on the face plane P and the t1 = h13 = u ; t2 = h23 = v ; t3 = h33 : (41)
Q
image conic I satisfy the following equation:
ri3 (i = 1; : : : ; 3) can be easily computed using the orthogo-
QP = 2 H tQI H: (24) nality of the rotation matrix.