CAM++: A Fast and Efficient Network For Speaker Verification Using Context-Aware Masking

Uploaded by

muxuxini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

CAM++: A Fast and Efficient Network For Speaker Verification Using Context-Aware Masking

Uploaded by

muxuxini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CAM++: A Fast and Efficient Network for Speaker Verification Using

Context-Aware Masking
Hui Wang, Siqi Zheng, Yafeng Chen, Luyao Cheng, Qian Chen

Speech Lab, Alibaba Group

{tongmu.wh, zsq174630}@alibaba-inc.com

Abstract networks tend to require a large number of parameters and com-

Time delay neural network (TDNN) has been proven to be putations to achieve optimal performance. In real-world appli-
efficient for speaker verification. One of its successful vari- cations, accuracy and efficiency are equally important. It is of
ants, ECAPA-TDNN, achieved state-of-the-art performance at sufficient interest and challenge to find a speaker embedding ex-
the cost of much higher computational complexity and slower tracting network that simultaneously improves the performance,
arXiv:2303.00332v3 [cs.SD] 16 Jun 2023

inference speed. This makes it inadequate for scenarios with computation complexity, and inference speed.
demanding inference rate and limited computational resources. Recently, [5] proposes a TDNN-based architecture, called
We are thus interested in finding an architecture that can achieve densely connected time delay neural network (D-TDNN), by
the performance of ECAPA-TDNN and the efficiency of vanilla adopting bottleneck layers and dense connectivity. It obtains
TDNN. In this paper, we propose an efficient network based on better accuracy with fewer parameters compared to vanilla
context-aware masking, namely CAM++, which uses densely TDNN. Later, in [6], a context-aware masking (CAM) mod-
connected time delay neural network (D-TDNN) as backbone ule is proposed to make the D-TDNN focus on the speaker of
and adopts a novel multi-granularity pooling to capture con- interest and “blur” unrelated noise, while requiring only a little
textual information at different levels. Extensive experiments computation cost. Despite of significant improvements on ac-
on two public benchmarks, VoxCeleb and CN-Celeb, demon- curacy, there still exists a large performance gap compared to
strate that the proposed architecture outperforms other main- other state-of-the-art speaker models [4].
stream speaker verification systems with lower computational In this paper, we propose CAM++, an efficient and accu-
cost and faster inference speed. † rate network for speaker embedding learning that utilizes D-
Index Terms: speaker verification, densely connected time TDNN as a backbone, as shown in Figure 1. We have adopted
delay neural network, context-aware masking, computational multiple methodologies to enhance the CAM module and D-
complexity TDNN architecture. Firstly, we design a lighter CAM module
and insert it into each D-TDNN layer to place more focus on the
speaker characteristics of interest. Multi-granularity pooling is
1. Introduction an essential component of the CAM module, built to capture
Speaker verification (SV) is the task of automatically verifying contextual information at both global and segment levels. The
whether an utterance is pronounced by a hypothesized speaker previous study in [12] showed that multi-granularity pooling
based on the voice characteristic [1]. Typically, a speaker ver- achieves comparable performance with much higher efficiency,
ification system consists of two main components - an embed- when compared to a transformer structure. Secondly, we adopt
ding extractor which transforms an utterance of random length a narrower network with fewer filters in each D-TDNN layer,
into a fixed-dimensional speaker embedding, and a back-end significantly increasing the network depth compared to vanilla
model that calculates the similarity score between the embed- D-TDNN [5]. This is motivated by [11], which observed that
dings [2, 3]. deeper layers can bring more improvements than wider chan-
Over past few years, speaker verification systems based on nels for speaker verification. Finally, we incorporate a two-
deep learning methods [2, 4, 5, 6, 7] have achieved remark- dimensional convolution module as a front-end to enhance the
able improvements. One of the most popular systems is x- D-TDNN network’s ability to be invariant to frequency shifts in
vector, which adopts time delay neural network (TDNN) as the input features. A hybrid architecture of TDNN and CNN
backbone. TDNN takes one-dimensional convolution along the has been shown to yield further improvements [13, 14]. We
time axis to capture local temporal context information. Fol- evaluate the proposed architecture on two public benchmarks,
lowing the successful application of x-vector, several modifi- VoxCeleb [15] and CN-Celeb [16, 17]. The results show that
cations are proposed to enhance robustness of the networks. our method obtains 0.73% and 6.78% EER in VoxCeleb-O and
ECAPA-TDNN [4] unifies one-dimensional Res2Block with CN-Celeb test sets. Furthermore, our architecture has lower
squeeze-excitation [8] and expands the temporal context of each computation complexity and faster inference speed than pop-
layer, achieving significant improvement. At the same time, the ular ECAPA-TDNN and ResNet34 systems.
topology of x-vector is improved by incorporating elements of
ResNet [9] which uses a two-dimensional convolutional neural 2. System description
network (CNN) with convolutions in both time and frequency
axes. Equiped with residual connection, ResNet-based sys- 2.1. Overview
tems [10, 11] have achieved outstanding results. However, these The overall framework of the proposed CAM++ architecture is
† The
source code is available at https://github.com/ illustrated in Figure 1. The architecture mainly consists of two
alibaba-damo-academy/3D-Speaker components: the front-end convolution module (FCM) and the
Specifically, the basic unit of D-TDNN consists of a feed-
forward neural network (FNN) and a TDNN layer. A direct
connection is applied between the input of two consecutive D-
TDNN layers. The formulation of the l-th D-TDNN layer is:

S l = Hl ([S 0 , S 1 , · · · , S l−1 ]) (1)

where S 0 is the input of the D-TDNN block, S l is the output of

the the l-th D-TDNN layer, Hl denotes the non-linear transfor-
mation of the l-th D-TDNN layer.
Although D-TDNN has demonstrated remarkable improve-
ment in comparison to vanilla TDNN, there remains a consid-
erable gap between it and state-of-the-art speaker embedding
models like ECAPA-TDNN and ResNet34. We redesign the
D-TDNN to further push its limits and achieve better results.
In [11], it is revealed that depth of the network plays a critical
role in the performance of speaker verification, and increasing
the depth of the speaker embedding model tends to yield more
improvement than widening it. Hence, we significantly increase
the depth of the D-TDNN network while reducing the channel
size of filters in each layer to control the network’s complexity.
Specifically, the vanilla D-TDNN has two blocks, each contain-
ing 6 and 12 D-TDNN layers, respectively. We add an addi-
tional block at the end and expand the number of layers per
block to 12, 24 and 16. To reduce the network’s complexity,
we adopt narrower D-TDNN layers in each block, that is, re-
ducing the original growth rate k from 64 to 32. Additionally,
we adopted an input TDNN layer with 1/2 subsampling rate
Figure 1: Overview of the proposed CAM++ architecture. It before the D-TDNN backbone to accelerate computation. In
comprises convolution modules as the front-end and D-TDNN Section 3.3, the experimental results will indicate that these ef-
as the backbone. An improved context-aware making is built fective modifications significantly improve the performance of
into each D-TDNN layer, which includes multi-granularity speaker verification.
pooling to capture speaker characteristics.
2.3. Context-aware masking
Attention mechanism has been widely adopted in speaker ver-
D-TDNN backbone. The FCM consists of multiple blocks of ification. Squeeze-excitation (SE) [8] squeezes global spatial
two-dimensional convolution with residual connections, which information into a channel descriptor to model channel inter-
encode acoustic features in the time-frequency domain to ex- dependencies and recalibrate filter responses. Meanwhile, soft
ploit high-resolution time-frequency details. The resulting fea- self-attention is utilized to calculate the weighted statistics for
ture map is subsequently flattened along the channel and fre- the improvement of temporal pooling techniques [19, 20, 21].
quency dimensions and used as input for the D-TDNN. The An attention-based context-aware masking (CAM) module
D-TDNN backbone comprises three blocks, each containing was recently proposed in [6] to focus on the speaker of inter-
a sequence of D-TDNN layers. In each D-TDNN layer, we est and blur unrelated noise, resulting a significant improve-
build an improved CAM module that assigns different atten- ment in the performance of D-TDNN. CAM performs feature
tion weights to the output feature of the inner TDNN layer. The map masking using an auxiliary utterance-level embedding ob-
multi-granularity pooling incorporates global average pooling tained from global statistic pooling. However, in [6], CAM is
and segment average pooling to effectively aggregate contextual only applied at the transition layer after each D-TDNN block,
information across different levels. With dense connections, the and a limited number of CAM modules may be insufficient for
masked output is concatenated with all preceding layers and extracting critical information effectively. To address this, we
serves as the input for the next layer. propose a lighter CAM and insert it into each D-TDNN layer to
capture more speaker characteristic of interest.
2.2. D-TDNN backbone As shown in Figure 1, we denote the output hidden feature
from the head FNN in the D-TDNN block as X. Firstly, X is
TDNN uses a dilated one-dimensional convolution structure input into the TDNN layer to extract local temporal feature F :
along the time axis as its backbone, which was first adopted by
x-vector [2]. Due to its success, TDNN has been widely used in F = F(X) (2)
speaker verification tasks. An improved version, D-TDNN, was
recently proposed in [5] as an efficient TDNN-based speaker where F(·) denotes the transformation of the TDNN layer. F(·)
embedding model. Similar to DenseNet [18], it adopts dense only focuses on local receptive field and F may be subopti-
connectivity, which involves direct connections among all lay- mal. Therefore, a ratio mask M is predicted based on an ex-
ers in a feed-forward manner. D-TDNN is parameter-efficient tracted contextual embedding e, and is expected to contain both
and achieves better results while requiring fewer parameters speaker of interest and noise characteristic.
than vanilla TDNN. Hence, we adopt D-TDNN as the backbone
of our network. M∗t = σ(W2 δ(W1 e + b1 ) + b2 ) (3)
where σ(·) denotes the Sigmoid function, δ(·) denotes the the success of ResNet-based architectures in speaker verifica-
ReLU function, and M∗t denotes the t-th frame of M . tion, we decide to incorporate 4 residual blocks in the FCM
In [6], a global statistic pooling is used to generate the con- stem, as illustrated in Figure 1. The number of channels is set to
textual embedding e. It is known that speech signals have typi- 32 for all residual blocks. We use a stride of 2 in the frequency
cal hierarchical structure and exhibit dynamic changes in char- dimension in the last three blocks, resulting in an 8x downsam-
acteristic between different subsegments. A unique speaking pling in the frequency domain. The output feature map of FCM
manner of the target speaker may exist within a certain seg- is subsequently flattened along the channel and frequency di-
ment. Simply using a single embedding from global pooling mension and used as input for the D-TDNN backbone.
may result in loss of precise local contextual information, lead-
ing to a suboptimal masking. Therefore, it is beneficial to ex- 3. Experiments
tend the single global pooling to multi-granularity pooling. This
enables the network to capture more contextual information at 3.1. Dataset
different levels, generating a more accurate mask. Specifically,
We conduct experiments on two public speaker verification
a global average pooling is used to extract contextual informa-
benchmarks, VoxCeleb [15] and CN-Celeb [16, 17], to evalu-
tion at global level:
ate the effectiveness of the proposed methods. For VoxCeleb,
1 X
T we use the development set of VoxCeleb2 for training, which
eg = X∗t (4) contains 5,994 speakers. The evaluation set is constructed from
T t=1
three cleaned version test trials, VoxCeleb1-O, VoxCeleb1-E
Simultaneously, a segment average pooling is used to extract and VoxCeleb1-H. The last two tasks have more trial pairs. For
contextual information at segment level: CN-Celeb, the development sets of CN-Celeb1 and CN-Celeb2
are used for training, which contain 2785 speakers. In the data
sk+1 −1
1 X preprocessing of the training data, we concatenate short utter-
eks = X∗t (5) ances to ensure that they are no less than 6s. There exists mul-
sk+1 − sk t=sk tiple utterances for each enrollment speaker in CN-Celeb test
Where sk is the starting frames of k-th segment of feature X. set. We choose to average all the embeddings which belong to
In the experiments, we segment the frame-level feature X into the same enrollment speaker to get final speaker embedding for
consecutive fixed-length 100-frame segments and apply seg- evaluation.
ment average pooling to each.
Subsequently, contextual embeddings of different level , eg 3.2. Experimental setup
and es , are aggregated to predict the context-aware mask M . For all experiments, we use 80-dimensional Fbank features ex-
The Equation 3 can be rewrote to: tracted over a 25 ms long window for every 10 ms as input. We
k
M∗t =σ(W2 δ(W1 (eg + eks ) + b1 ) + b2 ), apply speed perturbation augmentation by randomly sampling
a ratio from {0.9, 1.0, 1.1}. The processed audio is consid-
sk ⩽ t < sk+1 (6) ered to be from a new speaker [22]. In addition, two popular
Finally, predicted M is used to calibrate the representation and data augmentations are adopted during training, simulating re-
produce the refined representation F̃ . verberation using the RIR dataset [23], adding noise using the
MUSAN dataset [24].
F̃ = F(X) ⊙ M (7) Angular additive margin softmax (AAM-Softmax) loss [25]
Where ⊙ denotes the element-wise multiplication. Equation 6 is used for all experiments. The margin and scaling factors of
has a simpler form and fewer trainable parameters compared AAM-Softmax loss are set to 0.2 and 32 respectively. During
to [6]. We insert this efficient context-aware masking into each training, we adopt stochastic gradient descent (SGD) optimizer
D-TDNN layer to enhance the representational power of basic with a cosine annealing scheduler and a linear warm-up sched-
layers throughout the network. uler, where the learning rate is varied between 0.1 and 1e-4.
The momentum is 0.9, and the weight decay is 1e-4. 3s-long
2.4. Front-end convolution module samples are randomly cropped from each audio to construct the
training minibatches.
TDNN-based networks perform one-dimension convolution We use cosine similarity scoring for evaluation, without ap-
along the time axis, using kernels that cover the complete fre- plying score normalization in the back-end. We adopt two com-
quency range of the input features. It is more difficult to monly used metrics in speaker verification tasks, equal error
capture speaker characteristics occurring at certain local fre- rate (EER) and the minimum detection cost function (MinDCF)
quency regions compared to two-dimensional convolutional with 0.01 target probability.
network [13]. Generally, plenty of filters are required to model
the complex details in the full frequency region. For exam- 3.3. Results on VoxCeleb and CN-Celeb
ples, ECAPA-TDNN has a maximum of 1024 channels in the
convolutional layers to achieve optimal performance. In Sec- The performance overview of all methods is presented in Ta-
tion 2.2, we use narrower layers in each D-TDNN block to con- ble 1. For fair comparison, we re-implement several baseline
trol the size of parameters. This may result in a reduced ability models under the same experimental setup described in Sec-
to find the specific frequency pattern in some local regions. It is tion 3.2, including TDNN [2], D-TDNN [5], ECAPA-TDNN
necessary to enhance the robustness of D-TDNN to small and [4] and ResNet34 [10]. The ResNet34 model contains four
reasonable shifts in the time-frequency domain and compen- residual blocks with different channel sizes, [64, 128, 256, 512],
sate for realistic intra-speaker pronunciation variability. Mo- in each block. The ECAPA-TDNN model with 1024 channels
tivated by [13, 14], we equip the D-TDNN network with a two- is built according to [4].
dimensional front-end convolution module (FCM). Inspired by It can be found in Table 1 that, as an improved variant
Table 1: Performance comparison of different network architectures on the VoxCeleb1 and CN-Celeb test sets. Data augmentation
strategy and experimental setup are kept consistent throughout all experiments.

VoxCeleb1-O VoxCeleb1-E VoxCeleb1-H CN-Celeb Test

Architecture Params(M)
EER(%)/MinDCF EER(%)/MinDCF EER(%)/MinDCF EER(%)/MinDCF
TDNN 4.62 2.31/0.3223 2.37/0.2732 4.25/0.3931 9.86/0.6199
ECAPA-TDNN 14.66 0.89/0.0921 1.07/0.1185 1.98/0.1956 7.45/0.4127
ResNet34 6.70 0.97/0.0877 1.03/0.1133 1.88/0.1778 6.97/0.3859
D-TDNN 2.85 1.55/0.1656 1.63/0.1748 2.86/0.2571 8.41/0.4683
D-TDNN-L 6.40 1.19/0.1179 1.21/0.1287 2.22/0.2047 7.82/0.4336
CAM++ 7.18 0.73/0.0911 0.89/0.0995 1.76/0.1729 6.78/0.3830
-w/o Masking 6.64 0.93/0.1022 1.03/0.1144 1.86/0.1762 7.16/0.3947
-w/o FCM 6.94 0.98/0.1127 1.01/0.1175 2.03/0.2006 7.17/0.4011

Table 2: Performance comparison of multiple key components Table 3: The number of parameters, floating-point operations
of CAM++. GP represents masking with only global pooling (FLOPs) and real-time factor (RTF) of different models. RTF
and SP denotes segment pooling. was evaluated on CPU under single-thread condition.

CN-Celeb Test Model Params(M) FLOPs(G) RTF

Method Params(M)
EER(%) MinDCF ECAPA-TDNN 14.66 3.96 0.033
D-TDNN 2.85 8.41 0.4683 ResNet34 6.70 6.84 0.032
CAM [6] 4.10 7.80 0.4431 CAM++ 7.18 1.72 0.013
GP 3.07 7.78 0.4321
GP+SP 3.07 7.59 0.4209

segment average pooling (SP) and fuse it with GP, observing

of TDNN, ECAPA-TDNN achieves impressive improvement performance gains without introducing additional parameters.
in EER and MinDCF but requires large amounts of parame- These results indicate the importance of local segment contex-
ters. Using dense connection, D-TDNN outperforms TDNN tual information in performing more accurate masking.
with fewer parameters. Compared to the standard D-TDNN, it
can be found that deeper D-TDNN-L proposed in Section 2.2
achieves significant performance improvement, thanks to in- 3.5. Complexity analysis
creased parameters and effective modifications. However, there
is still a large performance gap compared to ECAPA-TDNN In this section, we compare the complexity of ECAPA-TDNN,
or ResNet34. When we equip the D-TDNN-L backbone with ResNet34 and CAM++ models, including the number of param-
CAM with multi-granularity pooling and FCM, CAM++ con- eters, floating-point operations (FLOPs) and real-time factor
sistently performs better than the ECAPA-TDNN and ResNet34 (RTF), as shown in Table 3. RTF was evaluated on the CPU de-
baselines. In particular, CAM++ has relative 51% fewer param- vice under single-thread condition. When comparing CAM++
eters and 18% lower EER than ECAPA-TDNN in VoxCeleb-O. with ResNet34, CAM++ has slightly more parameters but sig-
Next, we remove individual components to explore the con- nificant fewer FLOPs. At the same time, CAM++ has half the
tribution of each to the performance improvements. It can be parameters and FLOPs of ECAPA-TDNN. It is worth noting
observed that CAM with multi-granularity pooling improves the that CAM++ achieves more than twice the inference speed of
EER in VoxCeleb-O and CN-Celeb test sets by 21% and 5%, re- both ResNet34 and ECAPA-TDNN. Although ResNet34 and
spectively. This confirms the benefit of aggregating contextual ECAPA-TDNN have a similar RTF, they have different FLOPs.
vectors at different levels to perform attention masking. Re- This is likely due to increased memory access resulting from
moving FCM leads to a obvious increase in EER and MinDCF higher parameter data dependencies, which leads to increased
in all test sets. This phenomenon indicates that stronger speaker computation time.
embeddings can be obtained from a hybrid of two-dimensional
convolution and TDNN-based network.
4. Conclusion
3.4. Impacts of multi-granularity pooling
This paper proposed CAM++, an efficient speaker embedding
We further evaluate the effectiveness of the improved CAM model for speaker verification. Our novel context-aware mask-
with multi-granularity pooling. Additional experimental results ing method aimed to focus on the speaker of interest and im-
on the CN-Celeb test set are presented in Table 2. We use D- proved the quality of features, while multi-granularity pooling
TDNN [5] as the baseline. We re-implement the CAM proposed fused different levels of contextual information to generate ac-
in [6] on CN-Celeb, and find it decrease the EER by 7% rel- curate attention weights. We conducted comprehensive exper-
atively but with a 44% increase in parameters. Next, We ap- iments on two public benchmarks, VoxCeleb and CN-Celeb.
ply the improved CAM proposed in Section 2.3 to D-TDNN The results demonstrated that CAM++ achieved superior per-
only with global average pooling (GP), which results in simi- formance with lower computational complexity and faster in-
lar improvement in EER with only an 8% increase in parame- ference speed than popular ECAPA-TDNN and ResNet34 sys-
ters, demonstrating better parameters efficiency. We then apply tems.
5. References [18] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger,
“Densely connected convolutional networks,” in IEEE Confer-
[1] Z. Bai and X. Zhang, “Speaker recognition based on deep learn- ence on Computer Vision and Pattern Recognition (CVPR), 2017,
ing: An overview,” Neural Networks, vol. 140, pp. 65–99, 2021. p. 2261–2269.
[2] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- [19] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics
pur, “X-vectors: Robust dnn embeddings for speaker recognition,” pooling for deep speaker embedding,” in Annual Conference of
in IEEE International Conference on Acoustics, Speech and Sig- the International Speech Communication Association (INTER-
nal Processing (ICASSP), 2018, pp. 5329–5333. SPEECH), 2018, pp. 2252–2256.
[3] S. Zheng, G. Liu, H. Suo, and Y. Lei, “Autoencoder-based semi- [20] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive
supervised curriculum learning for out-of-domain speaker verifi- speaker embeddings for text independent speaker verification,” in
cation,” in Annual Conference of the International Speech Com- Annual Conference of the International Speech Communication
munication Association (INTERSPEECH), 2019, pp. 4360–4364. Association (INTERSPEECH), 2018, pp. 3573–3577.
[4] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA- [21] M. Indiay, P. Safariy, and J. Hernando, “Self multi-head atten-
TDNN: Emphasized channel attention, propagation and aggre- tion for speaker recognition,” in Annual Conference of the Inter-
gation in tdnn based speaker verification,” in Annual Conference national Speech Communication Association (INTERSPEECH),
of the International Speech Communication Association (INTER- 2019, pp. 4305–4309.
SPEECH), 2020, pp. 3830–3834.
[22] Z. Chen, B. Han, X. Xiang, H. Huang, B. Liu, and Y. Qian,
[5] Y.-Q. Yu and W.-J. Li, “Densely connected time delay neural net- “Build a sre challenge system: Lessons from voxsrc 2022 and
work for speaker verification,” in Annual Conference of the Inter- cnsrc 2022,” 2022, arXiv:2211.00815v1.
national Speech Communication Association (INTERSPEECH),
2020, pp. 921–925. [23] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur,
“A study on data augmentation of reverberant speech for robust
[6] Y.-Q. Yu, S. Zheng, H. Suo, Y. Lei, and W.-J. Li, “Cam: Context- speech recognition,” in IEEE International Conference on Acous-
aware masking for robust speaker verification,” in IEEE Interna- tics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–
tional Conference on Acoustics, Speech and Signal Processing 5224.
(ICASSP), 2021, pp. 6703–6707.
[24] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech,
[7] S. Zheng, Y. Lei, and H. Suo, “Phonetically-aware coupled net- and Noise Corpus,” 2015, arXiv:1510.08484v1.
work for short duration text-independent speaker verification,” in
Interspeech 2020, 21st Annual Conference of the International [25] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arc-face: Additive an-
Speech Communication Association. ISCA, 2020, pp. 926–930. gular margin loss for deep face recognition,” in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2019, pp.
[8] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” 4690–4699.
in IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2018, pp. 7132–7141.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2016, pp. 770–778.
[10] H. Zeinali, S. Wang, A. Silnova, P. Matejka, and O. Plchot,
“But system description to voxceleb speaker recognition chal-
lenge 2019,” 2019, arXiv:1910.12592.
[11] B. Liu, Z. Chen, S. Wang, H. Wang, B. Han, and Y. Qian, “Df-
resnet: Boosting speaker verification performance withdepth-first
design,” in Annual Conference of the International Speech Com-
munication Association (INTERSPEECH), 2022, pp. 296–300.
[12] C. Tan, Q. Chen, W. Wang, Q. Zhang, S. Zheng, and Z. Ling,
“Ponet: Pooling network for efficient token mixing in long se-
quences,” in The Tenth International Conference on Learning
Representations, ICLR 2022, Virtual Event, April 25-29, 2022.
[13] J. Thienpondt, B. Desplanques, and K. Demuynck, “Integrating
frequency translational invariance in tdnns and frequency posi-
tional information in 2d resnets to enhance speaker verification,”
in Annual Conference of the International Speech Communication
Association (INTERSPEECH), 2021, pp. 2302–2306.
[14] T. Liu, R. K. Das, K. Aik Lee, and H. Li, “MFA: TDNN
with multi-scale frequency-channel attention for text-independent
speaker verification with short utterances,” in IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2022, pp. 7517–7521.
[15] A. Nagrani, J. S. Chung, W. Xie, , and A. Zisserman, “Voxceleb:
Large-scale speaker verification in the wild,” Computer Speech
and Language, vol. 60, 2020.
[16] Y. Fan, J. Kang, L. Li, K. Li, H. Chen, S. Cheng, P. Zhang,
Z. Zhou, Y. Cai, and D. Wang, “CN-Celeb: a challenging chinese
speaker recognition dataset,” in IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
2020, pp. 7604–7608.
[17] L. Li, R. Liu, J. Kang, Y. Fan, H. Cui, Y. Cai, R. Vipperla, T. F.
Zheng, and D. Wang, “CN-Celeb: multi-genre speaker recogni-
tion,” Speech Communication, 2022.