Maskanyone Toolkit: Offering Strategies For Minimizing Privacy Risks and Maximizing Utility in Audio-Visual Data Archiving
Maskanyone Toolkit: Offering Strategies For Minimizing Privacy Risks and Maximizing Utility in Audio-Visual Data Archiving
Abstract
This paper introduces MaskAnyone, a novel toolkit designed to navigate some privacy and ethical concerns
of sharing audio-visual data in research. MaskAnyone offers a scalable, user-friendly solution for de-
identifying individuals in video and audio content through face-swapping and voice alteration, supporting
multi-person masking and real-time bulk processing. By integrating this tool within research practices, we
aim to enhance data reproducibility and utility in social science research. Our approach draws on Design
Science Research, proposing that MaskAnyone can facilitate safer data sharing and potentially reduce the
storage of fully identifiable data. We discuss the development and capabilities of MaskAnyone, explore its
integration into ethical research practices, and consider the broader implications of audio-visual data
masking, including issues of consent and the risk of misuse. The paper concludes with a preliminary
evaluation framework for assessing the effectiveness and ethical integration of masking tools in such
research settings.
Keywords: audio-visual data, open science, de-identification strategies, design science research, data
sharing
Introduction
Audio-visual data with human subjects is crucial for behavioral sciences and linguistics and provides
insights into human behavior and communication(Abney et al. 2018; Cienki 2016; D’Errico et al. 2015;
Gregori et al. 2023). However, including identifiable human data raises ethical and privacy concerns(Abay
et al. 2019; Benson et al. 2020; Bishop 2009; Jarolimkova and Drobikova 2019; Jeng et al. 2016;
Johannesson and Perjons 2021). GDPR outlines frameworks for handling such data responsibly(Nautsch
et al. 2019). In the spirit of open science, more platforms, artifacts, and tools are needed to balance privacy
and data sharing, especially in social sciences and humanities (Hunyadi et al., 2016; Qian et al., 2018).
Using audio-visual data in social and behavioral sciences requires a deep understanding of ethical, legal,
and methodological aspects. However, using such data is necessary to ensure research transparency and
reproducibility, with recent work arguing that retaining interview data should be the default(Resnik et al.
2024). Incorporating audio-visual data in research introduces a host of complexities, from data collection
and analysis to interpretation. This can impact the reliability and validity of findings, raising concerns about
potential biases, misinterpretations, and inadvertent capture of unrelated audiovisual data. Aligning and
1 Artificial Intelligence and Intelligent Systems, Hasso Plattner Institute, Potsdam, Berlin-Brandenburg, Germany,
[email protected]
2 Hasso Plattner Institute, University of Potsdam, Potsdam, Berlin-Brandenburg, Germany.
3 Donders Centre for Cognition, Radboud University, Nijmegen, Netherlands, [email protected]
4 Dutch Research Institute for Transitions, Erasmus University Rotterdam, Netherlands, [email protected]
harmonizing audio-visual inputs is particularly challenging, especially with recent trends in generating
audio-visual data. Considering these challenges, the proposed toolkit, MaskAnyone, becomes a necessity.
Researchers must take the lead in considering the unique challenges associated with audio-visual data to
safeguard their research's integrity, validity, and ethical conduct.
Building on existing work (Khasbage et al. 2022; Owoyele et al. 2022), we propose MaskAnyone, a toolkit
for de-identifying individuals using audio-visual data. MaskAnyone offers multi-person masking, real-time
bulk processing, and a user-friendly interface 3 - essential for handling large datasets efficiently while
reducing privacy risks. With accessibility in mind, social scientists can use MaskAnyone techniques to mask
identifiable information, such as face-swapping and auditory elements. This modular approach allows
researchers who do not write code to customize the anonymity level based on the sensitivity of the data and
available computing resources. The toolkit's flexibility is essential as it enables researchers to use it on
personal machines for smaller projects or scale it up to server-based environments for institutional research
involving larger datasets. This adaptability allows researchers to tailor the toolkit to their needs and
resources. Considering the challenges and related issues above, as well as the opportunities to leverage
developments in information systems and computer science, our paper is guided by the following research
questions and objectives:
1. How can researchers effectively navigate/balance subjects' privacy with the utility of audio-visual
data?
2. What techniques can be employed to ensure the ethical use of audio-visual data in compliance with
stringent regulatory frameworks?
3. What implications does masking audio/visual data have for generating and analyzing more
synthetic data, and what evaluation-related challenges remain in designing and iterating such
tools?
Drawing explicitly on the design science research framework(Hevner et al. 2004), we iteratively developed
a scalable toolkit to de-identify audiovisual data. Co-developed with researchers and data stewards, the
tool can be integrated into current research practices to promote ethical data sharing. MaskAnyone aims
to navigate the privacy risks associated with audio-visual data sharing. Distinct from existing solutions
(Khasbage et al. 2022; Owoyele et al. 2022), the toolkit also supports exporting body and face pose data as
JSON and Csvs along with multi-person masking and real-time bulk processing while offering a modular,
user-friendly interface. It incorporates various masking techniques for visual (e.g., face-swapping) and
auditory elements of identifiable information and is designed to be scalable for handling large datasets. By
providing customizable options, MaskAnyone allows researchers to balance privacy and data utility, and it
is specifically designed for scalability. Tailoring to data sensitivity requirements and computing needs,
MaskAnyone can run on personal machines or a (secured) server with a large amount of computing power
to accommodate multiple researchers with large datasets. Our paper proceeds as follows: First, we delve
into recent literature on the masking of audiovisual data and discuss the design methodology used. Next,
we detail the techniques and algorithms we employed for de-identification in video and audio domains. We
present our preliminary evaluations, describing the system, sample datasets, and proposed evaluation
metrics. The discussion section offers a broader outlook on how we want to evaluate toolkits like
MaskAnyone, ending with broader implications, opportunities, and issues of masking practices.
3https://anonymous.4open.science/r/Privacy4MultimodalAnalysis-
57D4/results/masking_s2_masking.png
anonymization and additional safeguards to enable audio-visual data sharing. As a field, we need to be able
to judge and validate what degree of masking is considered a reduction of identifiable information, de-
identification, pseudo-anonymization, or complete irreversible anonymization. This question is not just an
abstract theoretical question, as failure to comply with regulatory privacy frameworks can have direct legal
repercussions. These considerations form the challenging context within which audio-visual masking tools
must operate. Nonetheless, when integrated into ethical procedures already upheld by research institutes,
toolkits like MaskAnyone promise to support the safe sharing of multimodal language data.
Hiding
Upon detecting a person in a video, the hiding strategy offers multiple options for concealment. One
common approach is to obfuscate the identified area using techniques such as Gaussian blurring, pixelation,
or Laplacian edge detection, which can be implemented using image processing libraries such as OpenCV
(Bradski 2000; OpenCV 2023). While machine learning-based methods such as PiDiNet(Su et al. 2021) and
UAED (Zhou2023) may offer enhanced quality and privacy, they do not necessarily guarantee robust
anonymity, particularly for videos featuring well-known individuals (Hasan et al. 2018; Lander et al. 2001).
Another tactic involves overlaying the detected area with a non-translucent color, effectively hiding person-
specific attributes except for potentially identifiable characteristics like height or clothing. Alternatively,
one could employ inpainting techniques to estimate and replace the background behind the detected
individuals, as demonstrated by projects such as Inpaint Anything (Yu et al. 2023), STTN(Zeng et al. 2020),
and E2FGVI (Dang and Buschek 2021; Ripperda et al. 2020)(Li2022). While earlier methods like STTN
had limitations such as resolution constraints, recent developments like E2FGVI offer more flexibility and
improved performance. The field continues to evolve, mainly focusing on responsible data sharing
(Morehouse 2023), with emerging approaches like DMT (Yu 2023-2) promising even more effective results.
Masking
Hiding leads to a high loss of information, such as facial expressions or other expressive movements, that
may be important for linguistic and behavioral research. Masking aims to maintain a representation of this
information without retaining personally identifiable information. Landmark Detection for humans in
monocular video data refers to identifying and localizing specific points on the human body, such as joints
or facial features. The task can be further divided into 2D and 3D Landmark detection. Several models offer
advanced features for landmark detection but have specific limitations -- be it computational speed,
accuracy, or full-body image handling. For example, OpenPose is noted for its accuracy but is
computationally slower(Mroz et al. 2021). AlphaPose, an open-source model, has gained attention for
innovative approaches like Symmetric Integral Keypoints Regression (Fang2022). ViTPose offers a scalable
architecture for human pose estimation based on a Vision Transformer that can scale up to 1B parameters
(Xu et al. 2022). MediaPipe's Holistic model was developed to mitigate limitations in full-body detection
and offers an approximation of 3D positions. We, therefore, settled on using this approach for the current
version of MaskAnyone. The BlazePose model underlying MediaPipe uses a two-stage approach, in which
a single-shot-detection (SSD) based detector first locates the bounding boxes of people (Grishchenko and
preservation and privacy assurance. This is understandable as these individual tools were not explicitly
developed for our context and problem space – to help social and behavioral science researchers navigate
audio-visual data-sharing practices and the risk inherent in such endeavors.
Design Requirements
Our toolkit, MaskAnyone, is developed using the Design Science Research framework as an instantiation.
(Hevner et al. 2004). We enumerate the following design requirements based on user feedback from two
live demo workshops with researchers at a Dutch University(n=16), at a German University (n=10), and
data stewards(n=5). We have also used insights from existing literature on multimodal analysis to justify
the relevance of these requirements, although we know they are non-exhaustive depending on future
scenarios and evolving stakeholder needs (in multimodal behavior research)
Requirement Description Source/Literature
4https://anonymous.4open.science/r/Privacy4MultimodalAnalysis-
57D4/results/masking_s2_masking.png
In response to the requirements for server environment support from our survey and workshops with data
stewards and early career researchers, scalability, and extensibility (R5, R7, R8), we designed MaskAnyone
as a web-based application accessible via standard web browsers. The architecture employs a Manager--
Worker pattern to distribute tasks efficiently. The backend serves as the central hub for user interactions
and job management, while workers handle the computational heavy-lifting of video masking. This setup
allows for scalable and extensible operations without limiting the application's ability to run locally. Figure
-architecture illustrates this high-level architecture. The front-end uses React and TypeScript to create a
single-page application (SPA) that interacts with the backend via an HTTP API. The backend is accessible
through a Nginx reverse proxy. %and is built with Python for consistency with the worker component. Data
persistence is managed through PostgreSQL RDBMS, although media files are stored directly within the
file system. Worker processes, also implemented in Python, register with the backend and receive masking
jobs via an HTTP API. Docker and docker-compose are used for orchestration, making the application easy
to set up with minimal prerequisites.
The configuration process comprises four steps, offering users granular control over the masking. The first
step provides options to control the hiding parameters for detected people and, optionally, the background.
The second step involves specifying additional masking techniques to preserve important visual
information. In the third step, decisions on audio masking options can be made, such as keeping, removing,
or voice-converting the original audio. Finally, the fourth step allows for exporting additional data, such as
kinematic information, either for more advanced analytics or external processing, in combination with
audio-visual annotation tools (behavioral science) and NLP pipelines specific to the words in such masked
videos. This workflow is designed to meet our articulated usability and flexibility requirements, offering a
streamlined yet comprehensive masking solution.
MaskAnyone employs two primary methods for detecting people in videos: YOLOv8 and MediaPipe. YOLO
offers multiple pre-trained models with varying levels of accuracy and computational demand, including a
specialized face-detection model. MediaPipe provides pose detection and an internal person mask. Users
can adjust parameters like confidence thresholds for both methods to fine-tune utility and privacy
tradeoffs 5. After successful person detection, MaskAnyone offers several hiding techniques, visualized in
Table 2 below:
Strategy Description Effectiveness Illustrative
Results
Blackout Completely blacks out the detected High, provided the mask is
person, removing most identifiable accurately applied.
information if the mask is accurate.
Note that video inpainting methods are resource-intensive and may require powerful hardware for timely
execution. The integrated STTN model is limited to fixed resolutions, but alternative models supporting
different resolutions are available.
MaskAnyone provides various masking options to preserve key information while maintaining privacy.
These are summarized in Table 3 below and include:
Strategy Description Effectiveness Sample Illustration
5 For UI screenshots and videos of the different masking techniques, see the .mp4 files at
https://anonymous.4open.science/r/Privacy4MultimodalAnalysis-57D4/README.md
Face-Mesh Extracts and renders 478 facial High for facial detail
landmarks, preserving facial preservation.
features and expressions when
faces are adequately sized in the
frame.
Holistic Combines pose, face, and hand Comprehensive for A standalone program
landmark detection in a heuristic- full-body interaction yet to be integrated
based implementation. contexts. into MaskAnyone.
Face Swap Swaps the subject's face with a High for identity
selected target face, maintaining concealment while
expressions while hiding identity. preserving
expressions.
Rendered Utilizes Blender and the High for anonymity,
Avatar BlendARMocap plugin to maintains motion
transform motion coordinates into fidelity.
a 3D avatar.
Voice masking is a crucial extension of MaskAnyone's video masking capabilities. We offer three primary
approaches to audio handling, as enumerated in Table 4 below:
Strategy Description Effectiveness Sample
Demonstration
Preserve Retains the original audio and is Limited privacy, maintains N/A
suitable only for video masking. original audio.
Remove Eliminates all audio for maximum High for privacy, low for N/A
privacy. utility.
Switch Transforms the original voice to a Balances privacy with utility N/A
target voice. and engagement.
These features were implemented using MoviePy for voice extraction, RVC for voice conversion, and
FFmpeg for audio-video merging. Various pre-trained target voices are available, and the system is
extensible. We are also exploring other voice masking techniques, such as those from the Voice Privacy
Challenge. End-users are advised to manually evaluate the privacy level/requirements before sharing.
Automatic Evaluation
Video Evaluation
Utility: Our evaluation strategy is divided into privacy preservation and utility retention. For utility, we
focus on a specific use-case: emotion classification. We apply our face-swapping technique on selected
videos with precise headshots and then use an existing emotion classification model to categorize emotions
in these videos. We introduce an agreement score that measures the concordance between the emotional
states classified in the original and anonymized videos. This score serves as a proxy for utility retention,
with a high agreement indicating practical preservation of emotional cues. We sourced five images and
three videos for this preliminary evaluation, with subjects explicitly facing the camera. Face-swapping was
also applied to generate 15 test samples. The insights from our survey and user studies underscore the
importance of the usability of the toolkit. Moreover, the issues raised by the participants relate to the utility
of the data after it has been masked, depending on the research context and strategies selected to preserve
utility. At the same time, we learned that fully anonymizing videos is not 100% guaranteed; however, the
tradeoff can be balanced sensibly as more researchers can reproduce existing research when they can access
the data. This suggests avenues for further refinement of our proposed methodology.
Privacy: We categorize our techniques into hiding and masking methods for privacy preservation. Hiding
techniques primarily focus on object detection models, such as YOLOv8, to identify silhouettes, bounding
boxes, and facial features. Their efficacy is usually assessed using the mean average precision (MAP) metric.
In our case, the face-specific model achieved a MAP of 38%, and person detection models yielded MAP
scores ranging from 37.3% to 53.9%, depending on the version used. Masking techniques, on the other
hand, are evaluated through re-identification tasks. Due to the complexity of these tasks, a specialized
dataset and model are often required. We fine-tuned a Vision Transformer and triplet-loss-based Re-ID
model using a subset of the Celeb-A dataset. Preliminary results indicate a low precision score of 0.0017%,
suggesting practical privacy preservation. However, these results also emphasize the need for further
research to validate whether the low precision score indicates strong privacy or a model limitation.
Audio Evaluation
We have not performed a voice masking evaluation, but we outline a planned methodology for future
research focusing on utility and privacy. Utility assessment in voice masking can be complex, as it varies by
use case. However, general metrics like Word-Error Rate (WER) and Pitch Correlation could be considered.
WER can be computed using an automatic speech recognition system (ASR) like ESPnet or OpenAI's
Whisper. Pitch Correlation, significant for fields like psycholinguistics, can be measured using the Pearson
Correlation Coefficient between the input and output signals. The Librispeech dataset, widely used in ASR
evaluations, can be harnessed for this purpose. For privacy, we propose a re-identification or automatic
speaker verification (ASV) approach for privacy evaluation, using Equal Error Rate (EER) as the metric.
EER effectively gauges the ability to re-identify individuals in audio data and is implemented in popular
toolkits such as Kaldi. While the Librispeech dataset could potentially be used for privacy evaluation, the
suitability of other datasets should also be explored.
Human Evaluation
Human evaluation is pivotal in privacy and utility preservation and is the most challenging part of such
design science projects. We conducted a user study involving Ph.D. students(n=15). The students were
divided into two groups to serve as both control and treatment groups. Initially, each group was shown
unmasked celebrity videos to establish a recognition baseline. They were then shown masked videos and
asked to identify the individuals. One unmasked video was included to assess the impact of initial biasing.
The results are also visualized here 6 on the GitHub page. Notably, face-swapping techniques significantly
reduced re-identification scores. For instance, none could correctly identify Jackie Chan, possibly due to
the choice of the replacement face or cultural factors. Priming also had a notable impact, as demonstrated
by the difference in identification rates for the same video of Steph Curry when presented as either masked
or unmasked. These findings underscore the necessity for further research to validate and improve upon
our masking techniques. We also conducted two more user feedback studies with researchers(n=16) at
another European university. We can report that data stewards and early career researchers registered
interest in subsequently participating in a more robust evaluation of the toolkit. We have also taken some
of their feedback to streamline the masking strategies and added a login/database management option to
manage access rights between researchers in the same faculty/lab to masked data.
Discussion
Maintaining privacy and utility in video masking is challenging and context-dependent. Different masking
and hiding techniques may excel in privacy but limit utility in specific research settings. For this reason, we
have developed MaskAnyone, a platform that combines masking and hiding techniques to tailor to the
specific needs of researchers and other users. This masking tool also invites a discussion and a need to
validate what combination of features constitutes a certain level of privacy and utility. In this paper, we
have also proposed novel methods for evaluating those dimensions. We hope to contribute to such evidence-
based measures and become better situated within ethical regulatory frameworks like GDPR. A vital issue
is that it is unclear what counts as a proper reduction of identifiable information or anonymization. Human
evaluation experiments might be insightful in this regard, but they come with limitations, too. Using
celebrities to assess identifiability leverages background knowledge but likely inflates re-identification
scores relative to real-world use cases. This is because researchers who share data often do not know the
persons who are part of the original data recorded; the risk of identification is low. Legislation in the
European Union is also evolving on this matter, such that although identification might be possible in
principle if, in practice, data users are not likely to ascertain someone's identity, then a reduced risk of
privacy violation is implied (EUR-Lex 2023). Further discussion points for our proof-of-concept evaluation
methods include that providing a larger pool of identification choices may further reduce the likelihood of
correct identification, as evidenced by our last test video, where no options were given. Important
considerations include the cost and the typically small sample sizes of the resulting Human evaluations.
Finally, specific to our experiment, our small sample size limits the generalizability of our findings.
6 https://anonymous.4open.science/r/Privacy4MultimodalAnalysis-57D4/README.md
surveillance practices for the better. These uncovered use cases and possible extensions indicate room for
the tool's evolution to meet diverse privacy and utility needs. And the best part? Since MaskAnyone is open
source, it invites you, the audience, to shape its potential further. This first version of MaskAnyone has
several limitations. Factors such as lighting, camera angles, and the presence of multiple subjects can
influence anonymization accuracy and sometimes lead to identity leaks. Additionally, fine details like
micro-expressions may also be compromised. Another aspect to consider is the toolkit's resource
intensiveness, which could limit its accessibility. High computational demands and a 50GB storage
requirement could pose barriers, although a lightweight installation is available to mitigate these issues.
Scalability could be an issue, especially when handling large datasets or multiple users, leading to
performance bottlenecks. Ethical and legal considerations add further complexities. Potential misuse of
technology and compliance with data protection laws are issues that we are looking to address in close
engagement with data stewards and ethics personnel. Several technical enhancements are planned to refine
masking techniques and explore more robust machine-learning solutions. Such enhancements will focus
on making the toolkit more robust, interactive, and user-friendly, and we are exploring incorporating
identity and access management tools like Keycloak. We have ongoing collaborations with researchers and
data stewards so that the toolkit's development follows real-world needs and to explore how masking
toolkits can become integrated with ethical data archiving and sharing procedures. We also plan to enhance
the data utility aspect of the toolkit by incorporating modules for multimodal behavior analyses and
classification (multimodal analytics), which may include gesture detection (Ripperda et al. 2020) and
analyses(Dang and Buschek 2021; Zeng et al. 2023), and a summary of prosodic markers in the audio before
voice masking (ref). One other line of application of our toolkit that is timely relates to analyzing deep-fake
videos and AI-generated content, seeing the developments of SORA (Ho et al. 2022; openai 2024; Yan et
al. 2021)
Conclusion
Audio-visual data containing human subjects are central to behavioral sciences and linguistics research,
offering rich insights into human behavior and multimodal language use. (Dale 2008; Gregori et al. 2023;
Kim and Adler 2015; Linzen 2020) However, while central to these fields, audio-visual data also raises
ethical and privacy risks. In this context, GDPR and other legislatures play a pivotal role, providing
regulatory frameworks that researchers must navigate to balance privacy concerns with the reproducibility,
re-usability, and broader research utility of the data collected. Drawing on Design science, we envision how
problem-centered artifacts, as instantiated in toolkits like MaskAnyone, may become integrated with ethical
application procedures for audiovisual data archiving and research at universities and research institutes.
This could mean that less fully identifiable audiovisual data is stored than is strictly necessary for archiving
purposes. It could also mean that more audiovisual data can be safely shared with other researchers as
masking minimizes privacy risks. Furthermore, we envision that research groups also integrate masking
toolkits into their communication with other researchers at conferences, for example. Though there is no
systematic study on this, it appears not uncommon for researchers to either not exemplify the audiovisual
data that their research is based on due to privacy issues or a set of different kinds of video editing tools
(e.g., g, Adobe Premiere Pro, Powerpoint) are used to add a static box as a mask of the person's face which
is a suboptimal solution given the technological advancements in computer vision. In sum, we believe that
masking toolkits like MaskAnyone can significantly improve the ecology surrounding research archiving
and open science by providing a standardized set of masking and hiding strategies. Proper integration into
research practices would mean that ethical review boards at research institutes could help advise or review
what masking strategy is most optimal for a particular research context. Developing and adopting
sophisticated data masking tools like MaskAnyone is vital for advancing ethical research practices in the
digital age. These tools support compliance with stringent privacy regulations and foster a culture of
responsible data sharing, thereby enhancing the integrity and utility of research in behavioral sciences and
linguistics. The potential for such tools to be reviewed and recommended by ethical review boards further
underscores their importance in aligning technological capabilities with ethical research standards. Our
toolkit serves a diverse range of users including journalists, activists, and lawyers, addressing both the
potential for misuse, like creating deceptive deepfakes, and the need for privacy. We aim to balance these
concerns by engaging with data stewards and researchers to refine the use of our masking tools,
MaskAnyone, in ethical guidelines for socio-behavioral research. Standardizing such techniques in
academic data management can enhance privacy protections, allowing for safer data sharing and alleviating
concerns associated with presenting sensitive audio-visual data. This approach encourages the use of more
effective masking methods over traditional, less secure techniques.
Impact Statement
This paper presents MaskAnyone, a de-identification toolkit poised to significantly impact the fields of
behavioral sciences, linguistics, and any social science research involving sensitive audio-visual data. By
introducing advanced, user-friendly masking technologies, MaskAnyone is designed to address pressing
ethical and privacy concerns that often impede the sharing and utilizing of audio-visual data in research.
We anticipate this toolkit will enable broader, safer data-sharing practices in social science research. Early
career researchers, Data stewards, and institutional review boards are the primary audiences who will
benefit from this toolkit. They will find the artifact invaluable for augmenting ethical research, enhancing
reproducibility, and fostering open science while balancing de-identification concerns with data utility.
Additionally, the toolkit's design principles and underlying requirements serve as a model for developing
similar tools across various data-sensitive fields, e.g., healthcare and surveillance, towards promoting a
culture of ethical data use. Ultimately, this paper could guide policy-making and moral standards in
research institutions, advocating for a balanced approach to privacy and utility in audio-visual research
data management.
References
Abay, N. C., Zhou, Y., Kantarcioglu, M., Thuraisingham, B., and Sweeney, L. 2019. “Privacy Preserving
Synthetic Data Release Using Deep Learning,” in Machine Learning and Knowledge Discovery in
Databases, Lecture Notes in Computer Science, M. Berlingerio, F. Bonchi, T. Gärtner, N. Hurley,
and G. Ifrim (eds.), Cham: Springer International Publishing, pp. 510–526.
(https://doi.org/10.1007/978-3-030-10925-7_31).
Abney, D. H., Dale, R., Louwerse, M. M., and Kello, C. T. 2018. “The Bursts and Lulls of Multimodal
Interaction: Temporal Distributions of Behavior Reveal Differences Between Verbal and Non-
Verbal Communication,” Cognitive Science (42:4), Wiley-Blackwell Publishing, pp. 1297–1316.
(https://doi.org/10.1111/cogs.12612).
Balci, K. 2005. “Xface: Open Source Toolkit for Creating 3D Faces of an Embodied Conversational Agent,”
Lecture Notes in Computer Science (3638), Springer Verlag, pp. 263–266.
(https://doi.org/10.1007/11536482_25/COVER).
Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F., and Grundmann, M. 2020. BlazePose:
On-Device Real-Time Body Pose Tracking. (https://arxiv.org/abs/2006.10204v1).
Benson, C. H., Friz, A., Mullen, S., Block, L., and Gilmore‐Bykovskyi, A. 2020. “Ethical and Methodological
Considerations for Evaluating Participant Views on Alzheimer’s and Dementia Research,” Journal
of Empirical Research on Human Research Ethics. (https://doi.org/10.1177/1556264620974898).
Bishop, L. 2009. “Ethical Sharing and Reuse of Qualitative Data,” Australian Journal of Social Issues
(44:3), pp. 255–272. (https://doi.org/10.1002/j.1839-4655.2009.tb00145.x).
Bradski, G. 2000. “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools.
cgtinker. 2021. Cgtinker/BlendArMocap: Realtime Motion Tracking in Blender Using Mediapipe and
Rigify. (https://github.com/cgtinker/BlendArMocap).
Cienki, A. 2016. “Cognitive Linguistics, Gesture Studies, and Multimodal Communication,” Cognitive
Linguistics (27:4), Walter de Gruyter GmbH, pp. 603–618. (https://doi.org/10.1515/cog-2016-
0063).
Dang, H., and Buschek, D. 2021. “GestureMap: Supporting Visual Analytics and Quantitative Analysis of
Motion Elicitation Data by Learning 2D Embeddings,” in Proceedings of the 2021 CHI Conference
on Human Factors in Computing Systems, , May 6, pp. 1–12.
(https://doi.org/10.1145/3411764.3445765).
D’Errico, F., Poggi, I., Vinciarelli, A., and Vincze, L. 2015. “Conflict and Multimodal Communication: Social
Research and Machine Intelligence,” Confl. and Multimodal Communication: Soc. Research and
Machine Intelligence, Conflict and Multimodal Communication: Social Research and Machine
Intelligence, Springer International Publishing, p. 479. (https://doi.org/10.1007/978-3-319-
14081-0).
Fang, F., Wang, X., Yamagishi, J., Echizen, I., Todisco, M., Evans, N., and Bonastre, J.-F. 2019. Speaker
Anonymization Using X-Vector and Neural Waveform Models, International Speech
Communication Association, pp. 155–160. (https://doi.org/10.21437/SSW.2019-28).
Gafni, O., Wolf, L., and Taigman, Y. 2019. “Live Face De-Identification in Video,” Proceedings of the IEEE
International Conference on Computer Vision (2019-October), Institute of Electrical and
Electronics Engineers Inc., pp. 9377–9386. (https://doi.org/10.1109/ICCV.2019.00947).
Gregori, A., Amici, F., Brilmayer, I., Ćwiek, A., Fritzsche, L., Fuchs, S., Henlein, A., Herbort, O., Kügler, F.,
Lemanski, J., Liebal, K., Lücking, A., Mehler, A., Nguyen, K. T., Pouw, W., Prieto, P., Rohrer, P. L.,
Sánchez-Ramón, P. G., Schulte-Rüther, M., Schumacher, P. B., Schweinberger, S. R., Struckmeier,
V., Trettenbrein, P. C., and von Eiff, C. I. 2023. “A Roadmap for Technological Innovation
in Multimodal Communication Research,” in Lect. Notes Comput. Sci. (Vol. 14029 LNCS), Duffy
V.G. (ed.), Springer Science and Business Media Deutschland GmbH, pp. 402–438.
(https://doi.org/10.1007/978-3-031-35748-0_30).
Grishchenko, I., and Bazarevsky, V. 2020. MediaPipe Holistic — Simultaneous Face, Hand and Pose
Prediction, on Device – Google Research Blog.
(https://blog.research.google/2020/12/mediapipe-holistic-simultaneous-face.html?m=1).
Hasan, R., Hassan, E., Li, Y., Caine, K., Crandall, D. J., Hoyle, R., and Kapadia, A. 2018. “Viewer Experience
of Obscuring Scene Elements in Photos to Enhance Privacy,” in Conference on Human Factors in
Computing Systems - Proceedings (Vol. 2018-April). (https://doi.org/10.1145/3173574.3173621).
Hevner, A. R., March, S. T., Park, J., and Ram, S. 2004. “Design Science in Information Systems Research,”
MIS Quarterly (28:1), Management Information Systems Research Center, University of
Minnesota, pp. 75–105. (https://doi.org/10.2307/25148625).
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet,
D. J., and Salimans, T. 2022. Imagen Video: High Definition Video Generation with Diffusion
Models, arXiv. (https://doi.org/10.48550/arXiv.2210.02303).
Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., and Mohamed, A. 2021. “HuBERT:
Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,”
IEEE/ACM Transactions on Audio Speech and Language Processing (29), Institute of Electrical
and Electronics Engineers Inc., pp. 3451–3460. (https://doi.org/10.1109/TASLP.2021.3122291).
InsightFace. 2023. InsightFace: An Open Source 2D&3D Deep Face Analysis Library.
(https://insightface.ai/).
Jarolimkova, A., and Drobikova, B. 2019. “Data Sharing in Social Sciences: Case Study on Charles
University,” in Information Literacy in Everyday Life, Communications in Computer and
Jeng, W., He, D., and Oh, J. S. 2016. “Toward a Conceptual Framework for Data Sharing Practices in Social
Sciences: A Profile Approach,” Proceedings of the Association for Information Science and
Technology. (https://doi.org/10.1002/pra2.2016.14505301037).
Johannesson, P., and Perjons, E. 2021. “Ethics and Design Science,” in An Introduction to Design Science,
P. Johannesson and E. Perjons (eds.), Cham: Springer International Publishing, pp. 185–197.
(https://doi.org/10.1007/978-3-030-78132-3_13).
Kasturi, R., and Ekambaram, R. 2014. “Person Reidentification and Recognition in Video,” Lecture Notes
in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics) (8827), Springer Verlag, pp. 280–293. (https://doi.org/10.1007/978-3-
319-12568-8_35/COVER).
Khasbage, Y., Carrión, D. A., Hinnell, J., Robertson, F., Singla, K., Uhrig, P., and Turner, M. 2022. “The Red
Hen Anonymizer and the Red Hen Protocol for De-Identifying Audiovisual Recordings,”
Linguistics Vanguard, Walter de Gruyter GmbH. (https://doi.org/10.1515/LINGVAN-2022-
0017/MACHINEREADABLECITATION/RIS).
Kim, J., Kong, J., and Son, J. 2021. “Conditional Variational Autoencoder with Adversarial Learning for
End-to-End Text-to-Speech,” Proceedings of Machine Learning Research (139), ML Research
Press, pp. 5530–5540.
Kim, J. W., Salamon, J., Li, P., and Bello, J. P. 2018. “Crepe: A Convolutional Representation for Pitch
Estimation,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
- Proceedings (2018-April), Institute of Electrical and Electronics Engineers Inc., pp. 161–165.
(https://doi.org/10.1109/ICASSP.2018.8461329).
Kong, J., Kim, J., and Bae, J. 2020. “HiFi-GAN: Generative Adversarial Networks for Efficient and High
Fidelity Speech Synthesis,” Advances in Neural Information Processing Systems (2020-
December), Neural information processing systems foundation.
(https://arxiv.org/abs/2010.05646v2).
Korshunov, P., and Marcel, S. 2018. DeepFakes: A New Threat to Face Recognition? Assessment and
Detection. (https://arxiv.org/abs/1812.08685v1).
Lander, K., Bruce, V., and Hill, H. 2001. “Evaluating the Effectiveness of Pixelation and Blurring on
Masking the Identity of Familiar Faces,” Applied Cognitive Psychology (15:1).
(https://doi.org/10.1002/1099-0720(200101/02)15:1<101::AID-ACP697>3.0.CO;2-7).
Liu, K., Perov, I., Gao, D., Chervoniy, N., Zhou, W., Zhang, W., Liu, K., Perov, I., Gao, D., Chervoniy, N.,
Zhou, W., and Zhang, W. 2023. “Deepfacelab: Integrated, Flexible and Extensible Face-Swapping
Framework,” PatRe (141), Elsevier Ltd, p. 109628.
(https://doi.org/10.1016/J.PATCOG.2023.109628).
Mroz, S., Baddour, N., McGuirk, C., Juneau, P., Tu, A., Cheung, K., and Lemaire, E. 2021. “Comparing the
Quality of Human Pose Estimation with BlazePose or OpenPose,” BioSMART 2021 - Proceedings:
4th International Conference on Bio-Engineering for Smart Technologies, Institute of Electrical
and Electronics Engineers Inc. (https://doi.org/10.1109/BIOSMART54244.2021.9677850).
Nautsch, A., Jasserand, C., Kindt, E., Todisco, M., Trancoso, I., and Evans, N. 2019. “The GDPR & Speech
Data: Reflections of Legal and Technology Communities, First Steps towards a Common
Understanding,” Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH (2019-September), International Speech
Communication Association, pp. 3695–3699. (https://doi.org/10.21437/INTERSPEECH.2019-
2647).
Nirkin, Y., Keller, Y., and Hassner, T. 2019. “FSGAN: Subject Agnostic Face Swapping and Reenactment,”
Proceedings of the IEEE International Conference on Computer Vision (2019-October), Institute
of Electrical and Electronics Engineers Inc., pp. 7183–7192.
(https://doi.org/10.1109/ICCV.2019.00728).
Owoyele, B., Trujillo, J., Melo, G. de, and Pouw, W. 2022. “Masked-Piper: Masking Personal Identities in
Visual Recordings While Preserving Multimodal Information,” SoftwareX (20), Elsevier B.V., p.
101236. (https://doi.org/10.1016/j.softx.2022.101236).
Patino, J., Tomashenko, N., Todisco, M., Nautsch, A., and Evans, N. 2020. “Speaker Anonymisation Using
the McAdams Coefficient,” Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH (3), International Speech Communication
Association, pp. 1958–1962. (https://doi.org/10.21437/Interspeech.2021-1070).
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A. A., Tzionas, Di., and Black, M. J. 2019.
“Expressive Body Capture: 3D Hands, Face, and Body from a Single Image,” Proceedings of the
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2019-June),
IEEE Computer Society, pp. 10967–10977. (https://doi.org/10.1109/CVPR.2019.01123).
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. 2016. “You Only Look Once: Unified, Real-Time
Object Detection,” in Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (Vol. 2016-December). (https://doi.org/10.1109/CVPR.2016.91).
Resnik, D., Antes, A., and Mozersky, J. 2024. “Should Researchers Destroy Audio or Video Recordings?,”
Ethics & Human Research (46), pp. 30–35. (https://doi.org/10.1002/eahr.500205).
Reynolds, D. 2015. “Gaussian Mixture Models,” Encyclopedia of Biometrics, Springer, Boston, MA, pp.
827–832. (https://doi.org/10.1007/978-1-4899-7488-4_196).
Ripperda, J., Drijvers, L., and Holler, J. 2020. “Speeding up the Detection of Non-Iconic and Iconic
Gestures (SPUDNIG): A Toolkit for the Automatic Detection of Hand Movements and Gestures in
Video Data,” Behavior Research Methods (52:4), Springer, pp. 1783–1794.
(https://doi.org/10.3758/s13428-020-01350-2).
“Roadmaps from the Three Thematic DCCs – Digital Competence Centres | NWO.” (n.d.).
(https://www.nwo.nl/en/researchprogrammes/implementation-plan-investments-digital-
research-infrastructure/roadmaps-three, accessed April 30, 2024).
Rokoko. 2023. Intuitive and Affordable Motion Capture Tools for Character Animation.
(https://www.rokoko.com/).
Su, Z., Liu, W., Yu, Z., Hu, D., Liao, Q., Tian, Q., Pietikäinen, M., and Liu, L. 2021. “Pixel Difference
Networks for Efficient Edge Detection,” in Proceedings of the IEEE International Conference on
Computer Vision. (https://doi.org/10.1109/ICCV48922.2021.00507).
Tomashenko, N., Srivastava, B. M. L., Wang, X., Vincent, E., Nautsch, A., Yamagishi, J., Evans, N., Patino,
J., Bonastre, J. F., Noé, P. G., and Todisco, M. 2020. “Introducing the VoicePrivacy Initiative,”
Proceedings of the Annual Conference of the International Speech Communication Association,
INTERSPEECH (2020-October), International Speech Communication Association, pp. 1693–
1697. (https://doi.org/10.21437/INTERSPEECH.2020-1333).
Tomashenko, N., Wang, X., Vincent, E., Patino, J., Srivastava, B. M. L., Noé, P. G., Nautsch, A., Evans, N.,
Yamagishi, J., O’Brien, B., Chanclu, A., Bonastre, J. F., Todisco, M., and Maouche, M. 2022. “The
VoicePrivacy 2020 Challenge: Results and Findings,” Computer Speech & Language (74),
Academic Press, p. 101362. (https://doi.org/10.1016/J.CSL.2022.101362).
Wang, R., Ding, Y., Li, L., and Fan, C. 2020. “One-Shot Voice Conversion Using Star-Gan,” ICASSP, IEEE
International Conference on Acoustics, Speech and Signal Processing - Proceedings (2020-May),
Institute of Electrical and Electronics Engineers Inc., pp. 7729–7733.
(https://doi.org/10.1109/ICASSP40776.2020.9053842).
Xu, Y., Zhang, J., Zhang, Q., and Tao, D. 2022. “ViTPose: Simple Vision Transformer Baselines for Human
Pose Estimation,” Advances in Neural Information Processing Systems (35), Neural information
processing systems foundation. (https://arxiv.org/abs/2204.12484v3).
Yan, W., Zhang, Y., Abbeel, P., and Srinivas, A. 2021. VideoGPT: Video Generation Using VQ-VAE and
Transformers, arXiv. (https://doi.org/10.48550/arXiv.2104.10157).
Yu, T., Feng, Runseng, Feng, Ruoyu, Liu, J., Jin, X., Zeng, W., and Chen, Z. 2023. Inpaint Anything:
Segment Anything Meets Image Inpainting. (https://arxiv.org/abs/2304.06790v1).
Zeng, H., Wang, X., Wang, Y., Wu, A., Pong, T. C., and Qu, H. 2023. “GestureLens: Visual Analysis of
Gestures in Presentation Videos,” IEEE Transactions on Visualization and Computer Graphics
(29:8), pp. 3685–3697. (https://doi.org/10.1109/TVCG.2022.3169175).
Zeng, Y., Fu, J., and Chao, H. 2020. “Learning Joint Spatial-Temporal Transformations for Video
Inpainting,” Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics) (12361 LNCS), Springer Science and Business
Media Deutschland GmbH, pp. 528–543. (https://doi.org/10.1007/978-3-030-58517-4_31).
Zhao, J., and Zhang, H. 2022. “Thin-Plate Spline Motion Model for Image Animation,” Proceedings of the
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2022-June),
IEEE Computer Society, pp. 3647–3656. (https://doi.org/10.1109/CVPR52688.2022.00364).