0% found this document useful (0 votes)
30 views10 pages

Xia Sparse Local Patch Transformer for Robust Face Alignment and Landmarks CVPR 2022 Paper (1)

The document presents a Sparse Local Patch Transformer (SLPT) designed for robust face alignment by learning the inherent relations between facial landmarks through an attention mechanism. It introduces a coarse-to-fine framework that enhances landmark prediction accuracy by progressively refining local patches. Experimental results on benchmarks like WFLW, 300W, and COFW demonstrate that SLPT achieves state-of-the-art performance with reduced computational complexity.

Uploaded by

jawaechan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views10 pages

Xia Sparse Local Patch Transformer for Robust Face Alignment and Landmarks CVPR 2022 Paper (1)

The document presents a Sparse Local Patch Transformer (SLPT) designed for robust face alignment by learning the inherent relations between facial landmarks through an attention mechanism. It introduces a coarse-to-fine framework that enhances landmark prediction accuracy by progressively refining local patches. Experimental results on benchmarks like WFLW, 300W, and COFW demonstrate that SLPT achieves state-of-the-art performance with reduced computational complexity.

Uploaded by

jawaechan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Sparse Local Patch Transformer for Robust Face Alignment and Landmarks

Inherent Relation Learning

Jiahao Xia1 , Weiwei Qu2 , Wenjian Huang2 , Jianguo Zhang*,2 , Xi Wang3 , Min Xu*,1
1
Faculty of Engineering and IT, University of Technology Sydney
2
Dept. of Comp. Sci. and Eng., Southern University of Science and Technology, 3 CalmCar
[email protected], [email protected], {huangwj, zhangjg}@sustech.edu.cn, [email protected], [email protected]

Abstract Input Image Stage1


灤N landmarks
Heatmap regression methods have dominated face align- Stage2
ment area in recent years while they ignore the inherent re- Stage3
lation between different landmarks. In this paper, we pro-
CNN
pose a Sparse Local Patch Transformer (SLPT) for learn-
ing the inherent relation. The SLPT generates the repre-
sentation of each single landmark from a local patch and Results
aggregates them by an adaptive inherent relation based on
the attention mechanism. The subpixel coordinate of each
landmark is predicted independently based on the aggre- Sparse Local Patch
gated feature. Moreover, a coarse-to-fine framework is fur- NME:54.55% NME:7.12% Transformer
ther introduced to incorporate with the SLPT, which enables
the initial landmarks to gradually converge to the target
facial landmarks using fine-grained features from dynam-
ically resized local patches. Extensive experiments carried NME:4.60% NME:4.76% Landmark Queries
out on three popular benchmarks, including WFLW, 300W
and COFW, demonstrate that the proposed method works Figure 1. The proposed coarse-to-fine framework leverages the
at the state-of-the-art level with much less computational sparse local patches for robust face alignment. The sparse lo-
cal patches are cropped according to the landmarks in the previ-
complexity by learning the inherent relation between facial
ous stage and fed into the same SLPT to predict the facial land-
landmarks. The code is available at the project website1 . marks. Moreover, the patch size narrows down with the increasing
of stages to enable the local features to evolve into a pyramidal
form.
1. Introduction
Face alignment is aimed at locating a group of pre- achieve impressive performance [7, 18, 33–35] in recent
defined facial landmarks from images. Robust face align- years, they still ignore the inherent relation because convo-
ment based on deep learning technology has attracted in- lutional neural network (CNN) kernels focus locally, thus
creasing attention in recent years and it is the fundamen- failed to capture the relations of landmarks farther away in
tal algorithm in many face-related applications such as face a global manner. In particular, they consider the pixel co-
reenactment [40], face swapping [21] and driver fatigue de- ordinate with highest intensity of the output heatmap as the
tection [1]. Despite recent progress, it still remains a chal- optimal landmark, which inevitably introduces a quantiza-
lenging problem, especially for images with heavy occlu- tion error, especially for common downsampled heatmap.
sion, profile view and illumination variation. Coordinate regression methods [9,10,12,24,36,37,42] have
The inherent relation between facial landmarks play an an innate potential to learn the relation since it regresses the
important role in face alignment since human face has a coordinates from global feature directly via fully-connected
regular structure. Although heatmap regression methods layers (FC). Nevertheless, a coherent relation should be
* Corresponding Author learned together with local appearance while coordinate re-
1 https://github.com/Jiahao-UTS/SLPT-master gression methods lose the local feature by projecting the

4052
global feature into FC layers. 2. Related Work
To address the aforementioned problems, we propose a
In the early stage of face alignment, the mainstream
Sparse Local Patch Transformer (SLPT). Instead of predict-
methods [4, 6, 13, 24, 27, 31, 39, 44] regress facial land-
ing the coordinates from the full feature map like DETR
marks directly from the local feature with classical machine
[5], the SLPT firstly generates the representation for each
learning algorithms like random forest. With the develop-
landmark from a local patch. Then, a series of learnable
ment of CNN, the CNN-based face alignment methods have
queries, which are called landmark queries, are used to ag-
achieved impressive performance. They can be roughly di-
gregate the representations. Based on the cross-attention
vided into two categories: heatmap regression method and
mechanism of transformer, the SPLT learns an adaptive ad-
coordinate regression method.
jacency matrix in each layer. Finally, the subpixel coordi-
nate of each landmark in their corresponding patch is pre- 2.1. Coordinate Regression Method
dicted independently by a MLP. Due to the use of sparse
Coordinate regression methods [12,37,41,42] regress the
local patches, the number of the input token decreases sig-
coordinates of landmarks from feature map directly via FC
nificantly compared to other vision transformer [5, 11].
layers. To further improve the robustness, diverse cascaded
To further improve the performance, a coarse-to-fine
networks [17, 30] and recurrent networks [38] are proposed
framework is introduced to incorporate with the SLPT,
to achieve face alignment with multi stages. Despite coor-
as shown in Fig.1. Similar to cascaded shape regression
dinate regression methods have an innate potential to learn
method [13, 17, 44], the proposed framework optimizes a
the inherent relation, it commonly requires a huge number
group of initial landmarks to the target landmarks by sev-
of samples for training. To address the problem, Qian et
eral stages. The local patches in each stage are cropped
al. [26] and Dong et al. [9] expand the number of training
based on the initial landmarks or the landmarks predicted in
samples by style transfer; Browatzki et al. [3] and Dong et
the former stage, and the patch size for a specific stage is
al. [10] leverage the unlabeled dataset to train the model.
1/2 of its former stage. As a result, the local patches evolve
In recent years, state-of-the-art works employ the structure
in a pyramidal form and get closer to the target landmarks
information of face as the prior knowledge for better perfor-
for the fine-grained local feature.
mance. Lin et al. [24] and Li et al. [22] model the interac-
To verify the effectiveness of the SLPT and the pro-
tion between landmarks by a graph convolutional network
posed framework, we carry out experiments on three popu-
(GCN). However, the adjacency matrix of GCN is fixed dur-
lar benchmarks, WFLW [36], 300W [28] and COFW [4].
ing inference and cannot adjust case by case. Learning an
The results show the proposed method significantly out-
adaptive inherent relation is crucial for robust face align-
performs other state-of-the-art methods in terms of diverse
ment. Unfortunately, there is no work yet on this topic, and
metrics with much lower computational complexity. More-
we propose a method to fill this gap.
over, we also visualize the attention map of SLPT and the
inner product matrix of landmark queries to demonstrate the 2.2. Heatmap Regression Method
SLPT can learn the inherent relation of facial landmarks.
Heatmap regression methods [7, 25, 29, 34] output an
The main contributions of this work can be summarized
intermediate heatmap for each landmark and consider the
as:
pixel with highest intensity as the optimal output. There-
fore, it leads to quantization errors since the heatmap is
• We introduce a novel transformer, Sparse Local Patch
commonly much smaller than the input image. To eliminate
Transformer, to explore the inherent relation between
the error, Kumar et al. [18] estimate the uncertainty of pre-
facial landmarks based on the attention mechanism.
dicted landmark locations; Lan et al [19] adopt an additional
The adaptive inherent relation learned by SLPT en-
decimal heatmap for subpixel estimation; Huang et al. [15]
ables the model to achieve SOTA performance with
further regress the coordinate from an anisotropic attention
much less computational complexity.
mask generated from heatmaps. Moreover, heatmap regres-
• We introduce a coarse-to-fine framework to incorpo- sion methods also ignore the relation between landmarks.
rate with the SLPT, which enables the local patch to To construct the relation between neighboring points, Wu et
evolve in a pyramidal form and get closer to the target al. [36] and Wang et al. [35] take advantage of facial bound-
landmark for the fine-grained feature. aries as the prior knowledge; Zou et al. [47] cluster land-
marks with a graph model to provide structural constraints.
• Extensive experiments are conducted on three popu- However, they still cannot explicitly model an inherent re-
lar benchmarks, WFLW, 300W and COFW. The result lation between the landmarks with long distance.
illustrates the proposed method learns the inherent re- The vision transformer [11] proposed recently enables
lation of facial landmarks by the attention mechanism the model to attend the area with a long distance. Be-
and works at the SOTA level. sides, the attention mechanism in transformer can generate

4053
Backbone Patch Embedding Predicted Facial Prediction Cross-Attention
Landmarks Si Heads
Previous Landmarks Si-1

Encoding P
Local Local Local Local

Structure
Position 1 Position 2 Position 3 Position N

MLP

Number of Landmarks
CNN Feature Unoccluded

Patch Embedding
Map
Tl+1
R N灤
Feed-Forward & Add

Sparse Local Patches Layer Norm

Multi-head Cross-attention & Add

Layer Norm Mouth occluded


T’l
Multi-head Self-attention & Add

Layer Norm
Tl
Inherent
Relation
Layer Landmarks Queries Q Eye occluded

Figure 2. An overview of the SLPT. The SLPT crops local patches from the feature map according to the facial landmarks in the previous
stage. Each patch is then embedded into a vector that can be viewed as the representation of the corresponding landmark. Subsequently,
they are supplemented with the structure encoding to obtain the relative position in a regular face. A fixed number of landmark queries are
then input into the decoder, attending the vectors to learn the inherent relation between landmarks. Finally, the outputs are fed into a shared
MLP to estimate the position of each facial landmark independently. The rightmost images demonstrate the adaptive inherent relation of
different samples. We connect each point to the point with highest cross-attention weight in the first inherent relation layer.

an adaptive global attention for different tasks, such as ob- ings. Each encoding has high similarity with the encoding
ject detection [5,46] and human pose estimation [23], and in of neighboring landmark (eg. left eye and right eye).
principle, we envision that it can also learn an adaptive in- Inherent relation layer: Inspired by Transformer [32],
herent relation for face alignment. In this paper, we demon- we propose inherent relation layers to model the relation
strate the capability of SLPT for learning the relation. between landmarks. Each layer consists of three blocks,
multi-head self-attention (MSA) block, multi-head cross-
3. Method attention (MCA) block, and multilayer perceptron (MLP)
block, and an additional Layernorm (LN) is applied be-
3.1. Sparse Local Patch Transformer fore every block. Based on the self-attention mechanism in
As shown in Fig.2, Sparse Local Patch Transformer MSA block, the information of queries interact adaptively
(SLPT) consists of three parts, the patch embedding & for learning a query − query inherent relation. Suppos-
structure encoding, inherent relation layers and prediction ing the l-th MSA block obtains H heads, the input T l and
heads. landmark queries Q with CI -dimension are divided into H
Patch embedding & structure encoding: ViT [11] di- sequences equally (T l is a zero matrix in 1st layer). The
vides an image or a feature map I ∈ RHI ×WI ×C into a grid self-attention weight of the h-th head Ah is calculated by:
of H WI
Ph × Pw with each patch of size Ph ×Pw and maps it into
I

a d-dimension vector as the input. Different from ViT, for


\bm {A}_h=softmax\left ( \frac {\left (\bm {T}_h^{l} + \bm {Q}_h\right ) \bm {W}^q_h \left ( \left (\bm {T}_h^{l} + \bm {Q}_h\right )\bm {W}^k_h\right )^T}{\sqrt {C_h}}\right ),
each landmark, the SLPT crops a local patch with the fixed
size (Ph , Pw ) from the feature map as its supporting patch, (1)
whose center is located at the landmark. Then, the patches where Whq and Whk ∈ RCh ×Ch are the learnable parame-
are resized to K × K by linear interpolation and mapped ters of two linear layers. Thl ∈ RN ×Ch and Qh ∈ RN ×Ch
into a series of vectors by a CNN layer. Hence, each vector are the input and landmark queries respectively of the h-th
can be viewed as the representation of the corresponding head with the dimension Ch = CI /H. Then, MSA block
landmark. Besides, to retain the relative position of land- can be formulated as:
marks in a regular face shape (structure information), we
supplement the representations with a series of learnable MSA\left (\bm {T}^{l} \right ) = \left [ \bm {A}_1\bm {T}^l_1\bm {W}^v_1;...;\bm {A}_H\bm {T}^l_H\bm {W}^v_H \right ]\bm {W}_P, (2)
parameters called structure encoding. As shown in Fig.3,
the SLPT learns to encode the distance between landmarks where Whv ∈ RCh ×Ch and WP ∈ RCI ×CI are also the
within the regular facial structure in the similarity of encod- learnable parameters of linear layers.

4054
Algorithm 1 Training pipeline of the coarse-to-fine frame-
Structure Encoding Similarity work
Require: Training image I, initial landmarks S0 , back-
bone network B, SLPT T , loss function L, ground truth
Sgt , Stage number Nstage
1: while the training epoch is less than a specific number
do

Cosine Similarity
2: Forward B for feature map by F = B (I);
Initialize the local patch size (Pw , Ph ) ← W H

3: 4 , 4
4: for i ← 1 to Nstage do
5: Crop local pactes P from F according to former
landmarks Si−1 ;
6: Resize patches from (Pw , Ph ) to K × K;
7: Forward T for landmarks by Si = T (P );
8: Reduce the patch size (Pw , Ph ) by half;
9: end for 
10: Minimize L Sgt , S1 , S2 , · · · , SNstage
11: end while
Figure 3. Cosine similarity for structure encodings of SLPT
learned from a dataset with 98 landmark annotations. High co-
sine similarities are observed for the corresponding points which
Compared to using the full feature map, the number of
are close in the regular face structure.
representations decreases from H Ph × Pw to N (with the
I WI

HI WI
same input size, Ph × Pw is 16 × 16 in the related frame-
The MCA block aggregates the representations of fa- work [5]), which decreases the computational complexity
cial landmarks based on the cross-attention mechanism for significantly. For a 29 landmark dataset [4], Ω(S) is only
learning an adaptive representation − query relation. As 1/5 of Ω(F ) (H = 8 and Ch = 32 in the experiment).
shown in the rightmost images of Fig.2, by taking advantage Prediction head: the prediction head consists of a lay-
of the cross attention, each landmark can employ neigh- ernorm to normalize the input and a MLP layer to predict
boring landmarks for coherent prediction and the occluded the result. The output of the inherent relation layer is the
landmark can be predicted according to the representations local position of the landmark with respect to its supporting 
of visible landmarks. Similar to MSA, MCA also has H patch. Based on the local position on the i-th patch tix , tiy ,
heads and the attention weight in the h-th head A′h can be

the global coordinate of the i-th landmark xi , y i can be
calculated by: calculated by:
\begin {aligned} x^i &= x^i_{lt} + w^i t^i_x,\\ y^i &= y^i_{lt} + h^i t^i_y, \end {aligned}
(7)
\bm {A}_h^\prime =softmax\left ( \frac {\left (\bm {T}_h^{\prime l} + \bm {Q}_h\right ) \bm {W}^{\prime q}_h \left ( \left (\bm {R}_h + \bm {P}_h\right )\bm {W}^{\prime k}_h\right )^T}{\sqrt {C_h}}\right ).

(3) where (wi , hi ) is the size of the supporting patch.


Where Wh′q and Wh′k ∈ RCh ×Ch are learnable parameters
of two linear layers in the h-th head. Th′l ∈ RN ×Ch is the 3.2. Coarse-to-fine locating
input l-th MCA block; Ph ∈ RN ×Ch is the structure encod- To further improve the performance and robustness of
ings; Rh ∈ RN ×Ch is the landmark representations. MCA SLPT, we introduce a coarse-to-fine framework trained in
block can be formulated as: an end-to-end method to incorporate with the SLPT. The
MCA\left (\bm {T}^{\prime l} \right ) = \left [ \bm {A}^{\prime }_1\bm {T}^{\prime l}_1\bm {W}^{\prime v}_1;...;\bm {A}^{\prime }_H\bm {T}^{\prime l}_H\bm {W}^{\prime v}_H \right ]\bm {W}^{\prime }_P, (4) pseudo-code in Algorithm 1 shows the training pipeline of
the framework. It enables a group of initial facial landmarks
where Wh′v ∈ RCh ×Ch and WP′ ∈ RCI ×CI are also the S0 calculated from the mean face in the training set to con-
learnable parameters of linear layers in MCA block. verge to the target facial landmarks gradually with several
Supposing predicting N pre-defined landmarks, the stages. Each stage takes the previous landmarks as center to
computational complexity of the MCA that employ sparse crop a series of patches. Then, the patches are resized into a
local patches Ω(S) and full feature map Ω(F ) is: fixed size K × K and fed into the SLPT to predict the local
point on the supporting patches. Large patch size in the ini-
\Omega (S)=4HNC_h^2 + 2HN^2C_h, (5)
tial stage enables the SLPT to obtain a large receptive filed
that prevents the patch from deviating from the target land-
\Omega (F)=\left (2N+2\frac {W_IH_I}{P_wP_h} \right )HC_h^2 + 2NH\frac {W_IH_I}{P_wP_h}C_h. (6) mark. Then, the patch size in the following stages is 1/2 of

4055
Method NME(%)↓ FR0.1 (%)↓ AUC0.1 ↑ Inter-Ocular NME (%) ↓
Method
LAB [36] 5.27 7.56 0.532 Common Challenging Fullset
SAN [9] 5.22 6.32 0.535 SAN [9] 3.34 6.60 3.98
Coord⋆ [34] 4.76 5.04 0.549 Coord⋆ [34] 3.05 5.39 3.51
DETR† [5] 4.71 5.00 0.552 LAB [36] 2.98 5.19 3.49
Heatmap⋆ [34] 4.60 4.64 0.524 DeCaFA [7] 2.93 5.26 3.39
AVS+SAN [26] 4.39 4.08 0.591 HIH [19] 2.93 5.00 3.33
LUVLi [18] 4.37 3.12 0.557 Heatmap⋆ [34] 2.87 5.15 3.32
AWing [35] 4.36 2.84 0.572 SDFL⋆ [24] 2.88 4.93 3.28
SDFL⋆ [24] 4.35 2.72 0.576 HG-HSLE [47] 2.85 5.03 3.28
SDL⋆ [22] 4.21 3.04 0.589 LUVLi [18] 2.76 5.16 3.23
HIH [19] 4.18 2.84 0.597 AWing [35] 2.72 4.53 3.07
ADNet [15] 4.14 2.72 0.602 SDL⋆ [22] 2.62 4.77 3.04
SLPT‡ 4.20 3.04 0.588 ADNet [15] 2.53 4.58 2.93
SLPT† 4.14 2.76 0.595 SLPT‡ 2.78 4.93 3.20
SLPT† 2.75 4.90 3.17
Table 1. Performance comparison of the SLPT and the state-
of-the-art methods on WFLW. The normalization factor is inter- Table 2. Performance comparison for SLPT and the state-
ocular and the threshold for FR is set to 0.1. Key: [Best, Second of-the-art methods on 300W common subset, challenging sub-
Best, ⋆ =HRNetW18C, † =HRNetW18C-lite, ‡ =ResNet34] set and fullset. Key: [Best, Second Best, ⋆ =HRNetW18C,

=HRNetW18C-lite, ‡ =ResNet34]

its former stage, which enables the local patches to extract


fine-grained features and evolve into a pyramidal form. By attribute labels, such as profile face, heavy occlusion, make-
taking advantage of the pyramidal form, we can observe a up and illumination.
significant improvement for SLPT. (see Section 4.5). 300W is the most commonly used dataset that includes
3,148 images for training and 689 images for testing. The
3.3. Loss Function
training set consists of the fullset of AFW [45], the training
We employ the normalized L2 loss to provide the super- subset of HELEN [20] and LFPW [2]. The test set is further
vision for stages of the coarse-to-fine framework. More- divided into a challenging subset that includes 135 images
over, similar to other works [25, 29], providing additional (IBUG fullset [28]) and a common subset that consists of
supervision for the intermediate output during the training 554 images (test subset of HELEN and LFPW). Each image
is also helpful. Therefore, we feed the intermediate output in 300W is annotated with 68 facial landmarks.
of each inherent relation layer into a shared prediction head. COFW mainly consists of the samples with heavy oc-
The loss function is written as: clusion and profile face. The training set includes 1,345
images and each image is provided with 29 annotated land-
L = \frac {1}{SDN} \sum _{i=1}^{S} \sum _{j=1}^{D} \sum _{k=1}^{N} \frac {\left \| \left ( x_{gt}^k, y_{gt}^k\right ) - \left ( x^{ijk}, y^{ijk}\right ) \right \|_2}{d}, marks. The test set has two variants. One variant presents
29 landmarks annotation per face image (COFW), The other
(8) is provided with 68 annotated landmarks per face image
where S and D indicate the number of coarse-to-fine stage (COFW68 [14]). Both contains 507 images. We employ
the COFW68 set for cross-dataset validation.

and inherent relation layer respectively. xkgt , ygt
k
is the la-

beled coordinate of the k-th point. xijk , y ijk is the coor- 4.2. Evaluation Metrics
dinate of k-th point predicted by j-th inherent relation layer
in i-th stage. d is the distance between outer eye corners Referring to other related work [18, 24, 35], we evalu-
that acts as a normalization factor. ate the proposed methods with standard metrics, Normal-
ized Mean Error (NME), Failure Rate (FR) and Area Under
4. Experiment Curve (AUC). NME is defined as:
4.1. Datasets
NME\left (\bm {S}, \bm {S}_{gt}\right ) = \frac {1}{N}\sum _{i=1}^{N}\frac {\left \|\bm {p}^i-\bm {p}_{gt}^i\right \|_2}{d} \times 100\%, (9)
Experiments are conducted on three popular bench-
marks, including WFLW [36], 300W [28] and COFW [4].
WFLW dataset is a very challenging dataset that con- where S and Sgt denote the predicted and annotated coordi-
sists of 10,000 images, 7,500 for training and 2,500 for test- nates of landmarks respectively. pi and pigt indicate the co-
ing. It provides 98 manually annotated landmarks and rich ordinate of i-th landmark in S and Sgt . N is the number of

4056
Heatmap Coordinate Inter-Ocular Inter-Pupil
Ground Truth Ours Method
Regression Regression NME(%)↓FR(%)↓ NME(%)↓FR(%)↓
DAC-CSR [13] 6.03 4.73 - -
LAB [36] 3.92 0.39 - -
Coord⋆ [34] 3.73 0.39 - -
SDFL⋆ [24] 3.63 0.00 - -
Heatmap⋆ [34] 3.45 0.20 - -
Human [4] - - 5.60 -
TCDCN [42] - - 8.05 -
Wing [12] - - 5.44 3.75
DCFE [31] - - 5.27 7.29
AWing [35] - - 4.94 0.99
ADNet [15] - - 4.68 0.59
SLPT‡ 3.36 0.59 4.85 1.18
SLPT† 3.32 0.00 4.79 1.18

Table 3. NME and FR0.1 comparisons under Inter-Ocular nor-


malization and Inter-Pupil normalization on within-dataset vali-
dation. The threshold for failure rate (FR) is set to 0.1. Key: [Best,
Second Best, ⋆ =HRNetW18C, † =HRNetW18C-lite, ‡=ResNet34]

Inter-Pupil
Method FR0.1 (%)↓
Figure 4. Visualization of the ground truth and face alignment NME(%)↓
result of SLPT, heatmap regression (HRNetW18C) and coordinate TCDCN [42] 7.66 16.17
regression (HRNetW18C) method on the faces with blur, heavy CFSS [44] 6.28 9.07
occlusion and profile face. ODN [43] 5.30 -
AVS+SAN [26] 4.43 2.82
LAB [36] 4.62 2.17
landmarks, d is the reference distance to normalize the error. SDL⋆ [22] 4.22 0.39
d could be the distance between outer eye corners (inter- SDFL⋆ [24] 4.18 0.00
ocular) or the distance between pupil centers (inter-pupils). SLPT‡ 4.11 0.59
FR indicates the percentage of images in the test set whose SLPT† 4.10 0.59
NME is higher than a certain threshold. AUC is calculated
based on Cumulative Error Distribution (CED) curve. It in- Table 4. Inter-ocular NME and FR0.1 comparisons on 300W-
dicates the fraction of test images whose NME(%) is less or COFW68 cross-dataset evaluation. Key: [Best, Second Best,
equal to the value on the horizontal axis. AUC is the area

=HRNetW18C, † =HRNetW18C-lite, ‡ =ResNet34]
under CED curve, from zero to the threshold for FR.
4.4. Comparison with State-of-the-Art Method
4.3. Implementation Details
WFLW: as tabulated in Table 1 (more detailed results
Each input image is cropped and resized to 256 × 256. on the subset of WFLW are in Appendix A.2), SLPT
We train the proposed framework with Adam [8], setting the demonstrates impressive performance. With the increasing
initial learning rate to 1 × 10−3 . Without specifications, the of inherent layers, the performance of SLPT can be fur-
size of the resized patch is set to 7 × 7 and the framework ther improved and outperforms the ADNet (see Appendix
has 6 inherent relation layers and 3 coarse-to-fine stages. A.5). Referring to DETR, we also implement a Transformer
Besides, we augment the training set with random hori- based method that employs the full feature map for face
zontal flipping (50%), gray (20%), occlusion (33%), scal- alignment. The number of the input tokens is 16 × 16. With
ing (±5%), rotation (±30◦ ), translation (±10px). We im- the same backbone (HRNetW18C-lite), we observe an im-
plement our method with two different backbone: a light provement of 12.10% in NME, and the number of training
HRNetW18C [34] (the modularized block number in each epoch is 8× less than the DETR (see Appendix A.3). More-
stage is set to 1) and Resnet34 [16]. For the HRNetW18C- over, the SLPT also outperforms the coordinate regreesion
lite, the resolution of feature map is 64 × 64, and for the and heatmap regression methods significantly. Some qual-
Resnet34, we extract representations from the output fea- itative results are shown in Fig. 4. It is evident that our
ture maps of stages C2 through C5. (see Appendix A.1). method could localize the landmarks accurately, in partic-

4057
Intermediate Stage
Model 1st stage 2rd stage 3rd stage 4th stage
NME FR AUC NME FR AUC NME FR AUC NME FR AUC
Model† with 1 stage 4.79% 5.08% 0.583 - - - - - - - - -
Model† with 2 stages 4.52% 4.24% 0.563 4.27% 3.40% 0.585 - - - - - -
Model† with 3 stages 4.38% 3.60% 0.574 4.16% 2.80% 0.594 4.14% 2.76% 0.595 - - -
Model† with 4 stages 4.47% 4.00% 0.567 4.26% 3.40% 0.586 4.24% 3.36% 0.588 4.24% 3.32% 0.587

Table 5. Performance comparison of the SLPT with different number of coarse-to-fine stages on WFLW. The normalization factor for
NME is inter-ocular and the threshold for FR and AUC is set to 0.1. Key: [Best, † =HRNetW18C-lite]

Method MSA MCA NME FR AUC and coordinate regression methods respectively.
Model† 1 w/o w/o 4.48% 4.32% 0.566 For the cross-dataset validation, the training set includes
Model† 2 w/ w/o 4.20% 3.08% 0.590 the complete 300W dataset (3,837 images) and the test set is
Model† 3 w/o w/ 4.17% 2.84% 0.593 COFW68 (507 images with 68 landmark annotation). Most
Model† 4 w/ w/ 4.14% 2.76% 0.595 samples of COFW68 are under heavy occlusion. The inter-
ocular NME and FR of SLPT and the state-of-the-art meth-
Table 6. NME(↓), FR0.1 (↓) and AUC0.1 (↑) with/without Encoder ods are reported in Table 4. Compared to the methods based
and Decoder. Key: [Best, † =HRNetW18C-lite] on GCN (SDL and SDFL), the SLPT (HRNet) achieves im-
pressive result, as low as 4.10% in NME. The result illus-
Method NME FR AUC trates that the adaptive inherent relation of SLPT works bet-
w/o structure encoding† 4.16% 2.84% 0.593 ter than the fixed adjacency matrix of GCN for robust face
w structure encoding† 4.14% 2.76% 0.595 alignment, especially for the condition of heavy occlusion.

Table 7. NME(↓), FR0.1 (↓) and AUC0.1 (↑) with/without structure 4.5. Ablation Study
encoding. Key: [Best, , † =HRNetW18C-lite] Evaluation on different coarse-to-fine stages: to ex-
plore the contribution of the coarse-to-fine framework, we
train the SLPT with different number of coarse-to-fine
ular for face images with blur (2nd row in Fig.4), profile stages on the WFLW dataset. The NME, AUC0.1 and FR0.1
view (1st row in Fig.4) and heavy occlusion (3rd and 4th of each intermediate stage and the final stage are shown
row in Fig.4). in Table 5. Compared to the model with only one stage,
300W: the comparison result is shown in Table 2. Com- the local patches in multi-stages model evolve into a pyra-
pared to the coordinate and heatmap regression methods midal form, which improves the performance of interme-
(HRNetW18C [34]), SLPT still achieves an impressive im- diate stages and final stage significantly. When the stage
provement of 9.69% and 4.52% respectively in NME on the increases from 1 to 3, the NME of the first stage decreases
fullset. However, the improvement on 300W is not as sig- dramatically from 4.79% to 4.38%. When the number of
nificant as WFLW since learning an adaptive inherent re- stages is more than 3, the performance converges and addi-
lation requires a large number of annotated samples. With tional stages cannot bring any improvement to the model.
limited training samples, the methods with prior knowledge, Evaluation on MSA and MCA block: To explore
such as facial boundaries (Awing and ADNet) and affined the influence of query-query inter relation (eq.1) and
mean shape (SDL), always achieve better performance. representation-query inter relation (eq.3) created by MSA
COFW: We conduct two experiments on COFW for and MCA blocks, we implement four different models
comparsion, the within-dataset validation and cross-dataset with/without MSA and MCA, ranging from 1 to 4. For
validation. For the within-dataset validation, the model is the models without MCA block, we utilize the landmark
trained with 1,345 images and validated with 507 images on representations as the queries input. The performance of
COFW. The inter-ocular and inter-pupil NME of SLPT and the four models are tabulated in Table 6. Without MSA
the state-of-the-art methods are reported in Table 3 respec- and MCA, each landmark is regressed merely based on the
tively. In this experiment, the number of training sample is feature of the supporting patches in model 1. Neverthe-
quite small, which leads to the significant degradation of the less, it still outperforms other coordinate regression meth-
coordinate regression methods, such as SDFL, LAB. Nev- ods because of the coarse-to-fine framework. When self-
ertheless, SLPT still maintains excellent performance and attention or cross-attention is introduced into the model, the
yields the second best performance. It improves the metric performance is boosted significantly, reaching at 4.20% and
by 3.77% and 11.00% in NME over the heatmap regression 4.17% respectively in terms of NME. Moreover, the self-

4058
(a) MCA-layer 1 (b) MCA-layer 2 (c) MCA-layer 3 (d) MCA-layer 4 (e) MCA-layer 5 (f) MCA-layer 6

(g) MSA-layer 1 (h) MSA-layer 2 (i) MSA-layer 3 (j) MSA-layer 4 (k) MSA-layer 5 (l) MSA-layer 6

Figure 5. The statistical attention interactions of MCA and MSA in the final stage on the WFLW test set. Each row indicates the attention
weight of the landmark.

Method FLOPs(G) Params(M) number are shown in the Appendix A.4 and A.5.
HRNet⋆ [34] 4.75 9.66
LAB [36] 18.85 12.29 4.6. Visualization
AVS + SAN [26] 33.87 35.02 We calculate the mean attention weight of each MCA
AWing [35] 26.8 24.15 and MSA block on the WFLW test set, as shown in Fig.5.
DETR† (98 landmarks) [5] 4.26 11.00 We find out that the MCA block tends to aggregate the rep-
DETR† (68 landmarks) [5] 4.06 11.00 resentation of the supporting and neighboring patches to
DETR† (29 landmarks) [5] 3.80 10.99 generate the local feature, while MSA block tends to pay
SLPT† (98 landmarks) 6.12 13.19 attention to the landmarks with a long distance to create the
SLPT† (68 landmarks) 5.17 13.18 global feature. That is why the MCA block can incorporate
SLPT† (29 landmarks) 3.99 13.16 with the MSA block for better performance.

Table 8. Computational complexity and parameters of SLPT and 5. Conclusion


SOTA methods. Key: [⋆ =HRNetW18C, † =HRNetW18C-lite]]
In this paper, we find out that the inherent relation be-
tween landmarks is significant to the performance of face
attention and cross-attention can be combined to improve alignment while it is ignored by the most state-of-the-art
the performance of model further. methods. To address the problem, we propose a sparse
Evaluation on structure encoding: we implement two local patch transformer for learning a query-query and a
models with/without structure encoding to explore the influ- representation-query relation. Moreover, a coarse-to-fine
ence of structural information. With structural information, framework that enables the local patches to evolve into
the performance of SLPT is improved, as shown in Table 7. pyramidal former is proposed to further improve the perfor-
mance of SLPT. With the adaptive inherent relation learned
Evaluation on computational complexity: the com-
by SLPT, our method achieves robust face alignment, espe-
putational complexity and parameters of SLPT and other
cially for the faces with blur, heavy occlusion and profile
SOTA methods are shown in Table 8. The computational
view, and outperforms the state-of-the-art methods signifi-
complexity of SLPT is only 1/8 to 1/5 FLOPs of the previ-
cantly with much less computational complexity. Ablation
ous SOTA methods (AVS and AWing), demonstrating that
studies verify the effectiveness of the proposed method. In
learning inherent relation is more efficient than other meth-
future work, the inherent relation learning will be studied
ods. Although SLPT runs three times for coarse-to-fine lo-
further and extended to other tasks.
calization, patch embedding and linear interpolation proce-
dures, we do not observe a significant increasing of compu-
tational complexity, especially for 29 landmarks, because
Acknowledgment
the sparse local patches lead to less tokens. This work was sponsored by the program of China
Besides, the influence of patch size and inherent layer Scholarships Council (No. 202006130004).

4059
References mal direction in face alignment. In 2021 ICCV, pages 3060–
3070, 2021.
[1] Bram Bakker, Bartosz Zabłocki, Angela Baker, Vanessa
[16] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian.
Riethmeister, Bernd Marx, Girish Iyer, Anna Anund, and
Deep residual learning for image recognition. In CVPR,
Christer Ahlström. A multi-stage, multi-feature machine
pages 770–778, 2016.
learning approach to detect driver sleepiness in naturalistic
[17] Marek Kowalski, Jacek Naruniec, and Tomasz Trzcinski.
road driving conditions. IEEE Transactions on Intelligent
Deep alignment network: A convolutional neural network
Transportation Systems, pages 1–10, 2021.
for robust face alignment. In CVPRW, pages 2034–2043,
[2] Peter N. Belhumeur, David W. Jacobs, David J. Kriegman,
2017.
and Neeraj Kumar. Localizing parts of faces using a consen-
[18] Abhinav Kumar, Tim K. Marks, Wenxuan Mou, Ye Wang,
sus of exemplars. In CVPR, pages 545–552, 2011.
Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xi-
[3] Björn Browatzki and Christian Wallraven. 3fabrec: Fast
aoming Liu, and Chen Feng. Luvli face alignment: Esti-
few-shot face alignment by reconstruction. In CVPR, pages
mating landmarks’ location, uncertainty, and visibility like-
6109–6119, 2020.
lihood. In CVPR, pages 8233–8243, 2020.
[4] Xavier P. Burgos-Artizzu, Pietro Perona, and Piotr Dollár.
[19] Xing Lan, Qinghao Hu, and Jian Cheng. Revisting quantiza-
Robust face landmark estimation under occlusion. In ICCV,
tion error in face alignment. In 2021 ICCVW, pages 1521–
pages 1513–1520, 2013.
1530, 2021.
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- [20] Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, and
to-end object detection with transformers. In ECCV, pages Thomas S. Huang. Interactive facial feature localization. In
213–229, 2020. ECCV, pages 679–692, 2012.
[6] David Cristinacce and Tim Cootes. Feature detection and [21] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang
tracking with constrained local models. In BMVC, volume 3, Wen. Advancing high fidelity identity swapping for forgery
page 929–938, 2006. detection. In CVPR, pages 5073–5082, 2020.
[7] Arnaud Dapogny, Matthieu Cord, and Kevin Bailly. Decafa: [22] Weijian Li, Yuhang Lu, Kang Zheng, Haofu Liao, Chihung
Deep convolutional cascade for face alignment in the wild. Lin, Jiebo Luo, Chi-Tung Cheng, Jing Xiao, Le Lu, Chang-
In ICCV, pages 6892–6900, 2019. Fu Kuo, and Shun Miao. Structured landmark detection
[8] Kingma Diederik and Ba Jimmy. Adam: A method for via topology-adapting deep graph learning. In ECCV 2020,
stochastic optimization. In ICLR, 2015. pages 266–283, Cham, 2020. Springer International Publish-
ing.
[9] Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style
aggregated network for facial landmark detection. In CVPR, [23] Yanjie Li, Shoukui Zhang, Zhicheng Wang, Sen Yang,
pages 379–388, 2018. Wankou Yang, Shu-Tao Xia, and Erjin Zhou. Tokenpose:
[10] Xuanyi Dong, Yi Yang, Shih-En Wei, Xinshuo Weng, Yaser Learning keypoint tokens for human pose estimation. In
Sheikh, and Shoou-I Yu. Supervision by registration and ICCV, 2021.
triangulation for landmark detection. IEEE Transactions [24] Chunze Lin, Beier Zhu, Quan Wang, Renjie Liao, Chen
on Pattern Analysis and Machine Intelligence, 43(10):3681– Qian, Jiwen Lu, and Jie Zhou. Structure-coherent deep fea-
3694, 2021. ture learning for robust face alignment. IEEE Transactions
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, on Image Processing, 30:5313–5326, 2021.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [25] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- glass networks for human pose estimation. In ECCV, pages
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is 483–499, 2016.
worth 16x16 words: Transformers for image recognition at [26] Shengju Qian, Keqiang Sun, Wayne Wu, Chen Qian, and Ji-
scale. In ICLR, 2021. aya Jia. Aggregation via separation: Boosting facial land-
[12] Zhenhua Feng, Josef Kittler, Muhammad Awais, Patrik Hu- mark detector with semi-supervised style translation. In
ber, and Xiaojun Wu. Wing loss for robust facial landmark ICCV, pages 10152–10162, 2019.
localisation with convolutional neural networks. In CVPR, [27] Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. Face
pages 2235–2245, 2018. alignment at 3000 fps via regressing local binary features. In
[13] Zhenhua Feng, Josef Kittler, William Christmas, Patrik Hu- CVPR, pages 1685–1692, 2014.
ber, and Xiaojun Wu. Dynamic attention-controlled cas- [28] Christos Sagonas, Georgios Tzimiropoulos, Stefanos
caded shape regression exploiting training data augmenta- Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge:
tion and fuzzy-set sample weighting. In CVPR, pages 3681– The first facial landmark localization challenge. In ICCVW,
3690, 2017. pages 397–403, 2013.
[14] Golnaz Ghiasi and Charless C. Fowlkes. Occlusion co- [29] Zhiqiang Tang, Xi Peng, Kang Li, and Dimitris N. Metaxas.
herence: Localizing occluded faces with a hierarchical de- Towards efficient u-nets: A coupled and quantized approach.
formable part model. In 2014 IEEE Conference on Computer IEEE Transactions on Pattern Analysis and Machine Intelli-
Vision and Pattern Recognition, pages 1899–1906, 2014. gence, 42(8):2038–2050, 2020.
[15] Yangyu Huang, Hao Yang, Chong Li, Jongyoo Kim, and [30] George Trigeorgis, Patrick Snape, Mihalis A. Nico-
Fangyun Wei. Adnet: Leveraging error-bias towards nor- laou, Epameinondas Antonakos, and Stefanos Zafeiriou.

4060
Mnemonic descent method: A recurrent process applied for [46] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
end-to-end face alignment. In CVPR, pages 4177–4187, and Jifeng Dai. Deformable detr: Deformable transformers
2016. for end-to-end object detection. ICLR, 2020.
[31] Roberto Valle, José M. Buenaposada, Antonio Valdés, and [47] Xu Zou, Sheng Zhong, Luxin Yan, Xiangyun Zhao, Jiahuan
Luis Baumela. A deeply-initialized coarse-to-fine ensemble Zhou, and Ying Wu. Learning robust facial landmark de-
of regression trees for face alignment. In ECCV, pages 609– tection via hierarchical structured ensemble. In 2019 ICCV,
624, Cham, 2018. pages 141–150, 2019.
[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Il-
lia Polosukhin. Attention is all you need. In NIPS, page
6000–6010, Red Hook, NY, USA, 2017.
[33] Jun Wan, Zhihui Lai, Jun Liu, Jie Zhou, and Can Gao. Ro-
bust face alignment by multi-order high-precision hourglass
network. IEEE Transactions on Image Processing, 30:121–
133, 2021.
[34] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang,
Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui
Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep
high-resolution representation learning for visual recogni-
tion. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 43(10):3349–3364, 2021.
[35] Xinyao Wang, Liefeng Bo, and Li Fuxin. Adaptive wing loss
for robust face alignment via heatmap regression. In ICCV,
pages 6970–6980, 2019.
[36] Wenyan Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai,
and Qiang Zhou. Look at boundary: A boundary-aware face
alignment algorithm. In CVPR, pages 2129–2138, 2018.
[37] Wenyan Wu and Shuo Yang. Leveraging intra and inter-
dataset variations for robust face alignment. In CVPRW,
pages 2096–2105, 2017.
[38] Shengtao Xiao, Jiashi Feng, Junliang Xing, Hanjiang Lai,
Shuicheng Yan, and Ashraf Kassim. Robust facial land-
mark detection via recurrent attentive-refinement networks.
In ECCV, pages 57–72, Cham, 2016.
[39] Xuehan Xiong and Fernando De la Torre. Supervised descent
method and its applications to face alignment. In CVPR,
pages 532–539, 2013.
[40] Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu
Pan, Liang Liu, Yong Liu, Yu Ding, and Changjie Fan.
Freenet: Multi-identity face reenactment. In CVPR, pages
5325–5334, 2020.
[41] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao.
Joint face detection and alignment using multitask cascaded
convolutional networks. IEEE Signal Processing Letters,
23(10):1499–1503, 2016.
[42] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou
Tang. Facial landmark detection by deep multi-task learning.
In ECCV, pages 94–108, Cham, 2014.
[43] Meilu Zhu, Daming Shi, Mingjie Zheng, and Muhammad
Sadiq. Robust facial landmark detection via occlusion-
adaptive deep networks. In CVPR, pages 3481–3491, 2019.
[44] Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou
Tang. Face alignment by coarse-to-fine shape searching. In
CVPR, pages 4998–5006, 2015.
[45] Xiangxin Zhu and Deva Ramanan. Face detection, pose es-
timation, and landmark localization in the wild. In CVPR,
pages 2879–2886, 2012.

4061

You might also like