CAM++: A Fast and Efficient Network For Speaker Verification Using Context-Aware Masking
CAM++: A Fast and Efficient Network For Speaker Verification Using Context-Aware Masking
Context-Aware Masking
Hui Wang, Siqi Zheng, Yafeng Chen, Luyao Cheng, Qian Chen
inference speed. This makes it inadequate for scenarios with computation complexity, and inference speed.
demanding inference rate and limited computational resources. Recently, [5] proposes a TDNN-based architecture, called
We are thus interested in finding an architecture that can achieve densely connected time delay neural network (D-TDNN), by
the performance of ECAPA-TDNN and the efficiency of vanilla adopting bottleneck layers and dense connectivity. It obtains
TDNN. In this paper, we propose an efficient network based on better accuracy with fewer parameters compared to vanilla
context-aware masking, namely CAM++, which uses densely TDNN. Later, in [6], a context-aware masking (CAM) mod-
connected time delay neural network (D-TDNN) as backbone ule is proposed to make the D-TDNN focus on the speaker of
and adopts a novel multi-granularity pooling to capture con- interest and “blur” unrelated noise, while requiring only a little
textual information at different levels. Extensive experiments computation cost. Despite of significant improvements on ac-
on two public benchmarks, VoxCeleb and CN-Celeb, demon- curacy, there still exists a large performance gap compared to
strate that the proposed architecture outperforms other main- other state-of-the-art speaker models [4].
stream speaker verification systems with lower computational In this paper, we propose CAM++, an efficient and accu-
cost and faster inference speed. † rate network for speaker embedding learning that utilizes D-
Index Terms: speaker verification, densely connected time TDNN as a backbone, as shown in Figure 1. We have adopted
delay neural network, context-aware masking, computational multiple methodologies to enhance the CAM module and D-
complexity TDNN architecture. Firstly, we design a lighter CAM module
and insert it into each D-TDNN layer to place more focus on the
speaker characteristics of interest. Multi-granularity pooling is
1. Introduction an essential component of the CAM module, built to capture
Speaker verification (SV) is the task of automatically verifying contextual information at both global and segment levels. The
whether an utterance is pronounced by a hypothesized speaker previous study in [12] showed that multi-granularity pooling
based on the voice characteristic [1]. Typically, a speaker ver- achieves comparable performance with much higher efficiency,
ification system consists of two main components - an embed- when compared to a transformer structure. Secondly, we adopt
ding extractor which transforms an utterance of random length a narrower network with fewer filters in each D-TDNN layer,
into a fixed-dimensional speaker embedding, and a back-end significantly increasing the network depth compared to vanilla
model that calculates the similarity score between the embed- D-TDNN [5]. This is motivated by [11], which observed that
dings [2, 3]. deeper layers can bring more improvements than wider chan-
Over past few years, speaker verification systems based on nels for speaker verification. Finally, we incorporate a two-
deep learning methods [2, 4, 5, 6, 7] have achieved remark- dimensional convolution module as a front-end to enhance the
able improvements. One of the most popular systems is x- D-TDNN network’s ability to be invariant to frequency shifts in
vector, which adopts time delay neural network (TDNN) as the input features. A hybrid architecture of TDNN and CNN
backbone. TDNN takes one-dimensional convolution along the has been shown to yield further improvements [13, 14]. We
time axis to capture local temporal context information. Fol- evaluate the proposed architecture on two public benchmarks,
lowing the successful application of x-vector, several modifi- VoxCeleb [15] and CN-Celeb [16, 17]. The results show that
cations are proposed to enhance robustness of the networks. our method obtains 0.73% and 6.78% EER in VoxCeleb-O and
ECAPA-TDNN [4] unifies one-dimensional Res2Block with CN-Celeb test sets. Furthermore, our architecture has lower
squeeze-excitation [8] and expands the temporal context of each computation complexity and faster inference speed than pop-
layer, achieving significant improvement. At the same time, the ular ECAPA-TDNN and ResNet34 systems.
topology of x-vector is improved by incorporating elements of
ResNet [9] which uses a two-dimensional convolutional neural 2. System description
network (CNN) with convolutions in both time and frequency
axes. Equiped with residual connection, ResNet-based sys- 2.1. Overview
tems [10, 11] have achieved outstanding results. However, these The overall framework of the proposed CAM++ architecture is
† The
source code is available at https://github.com/ illustrated in Figure 1. The architecture mainly consists of two
alibaba-damo-academy/3D-Speaker components: the front-end convolution module (FCM) and the
Specifically, the basic unit of D-TDNN consists of a feed-
forward neural network (FNN) and a TDNN layer. A direct
connection is applied between the input of two consecutive D-
TDNN layers. The formulation of the l-th D-TDNN layer is:
Table 2: Performance comparison of multiple key components Table 3: The number of parameters, floating-point operations
of CAM++. GP represents masking with only global pooling (FLOPs) and real-time factor (RTF) of different models. RTF
and SP denotes segment pooling. was evaluated on CPU under single-thread condition.