0% found this document useful (0 votes)

34 views

M Z & M B: A Standardized Toolkit For Multimodal Deep Learning

Uploaded by

amrointerlockbricks

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

M Z & M B: A Standardized Toolkit For Multimodal Deep Learning

Uploaded by

amrointerlockbricks

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Journal of Machine Learning Research 24 (2023) 1-7 Submitted 9/22; Revised 4/23; Published 5/23

M ULTI Z OO & M ULTI B ENCH:

A Standardized Toolkit for Multimodal Deep Learning

Paul Pu Liang, Yiwei Lyu, Xiang Fan, { PLIANG , YLYU 1, XIANGFAN }@ CS . CMU . EDU
Arav Agarwal, Yun Cheng, { ARAVA , YUNCHENG }@ CS . CMU . EDU
Louis-Philippe Morency, Ruslan Salakhutdinov { MORENCY, RSALAKHU }@ CS . CMU . EDU
Machine Learning Department and Language Technologies Institute, Carnegie Mellon University

Editor: Antti Honkela

Abstract
Learning multimodal representations involves integrating information from multiple heterogeneous
sources of data. In order to accelerate progress towards understudied modalities and tasks while
ensuring real-world robustness, we release M ULTI Z OO, a public toolkit consisting of standardized
implementations of > 20 core multimodal algorithms and M ULTI B ENCH, a large-scale benchmark
spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. Together, these provide
an automated end-to-end machine learning pipeline that simplifies and standardizes data loading,
experimental setup, and model evaluation. To enable holistic evaluation, we offer a comprehensive
methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness.
M ULTI B ENCH paves the way towards a better understanding of the capabilities and limitations of
multimodal models, while ensuring ease of use, accessibility, and reproducibility. Our toolkits are
publicly available, will be regularly updated, and welcome inputs from the community1 .
Code: https://github.com/pliang279/MultiBench
Documentation: https://multibench.readthedocs.io/en/latest/
Keywords: Multimodal learning, Representation learning, Benchmarks, Open Source Software

1. Introduction
The research field of multimodal machine learning (ML) brings unique challenges for both compu-
tational and theoretical research given the heterogeneity of various data sources (Baltrušaitis et al.,
2018; Liang et al., 2022). At its core lies the learning of multimodal representations that capture cor-
respondences between modalities for prediction, and has emerged as a vibrant interdisciplinary field
of immense importance and with extraordinary potential in multimedia (Naphade et al., 2006; Liang
et al., 2023), affective computing (Liang et al., 2019; Poria et al., 2017), robotics (Kirchner et al.,
2019; Lee et al., 2019), finance (Hollerer et al., 2018), dialogue (Pittermann et al., 2010), human-
computer interaction (Dumas et al., 2009; Obrenovic and Starcevic, 2004), and healthcare (Frantzidis
et al., 2010; Xu et al., 2019). In order to accelerate research in building general-purpose multimodal
models across diverse research areas, modalities, and tasks, we contribute M ULTI B ENCH (Figure 1),
a systematic and unified large-scale benchmark that brings us closer to the requirements of real-world
multimodal applications. M ULTI B ENCH contains a diverse set of 15 datasets spanning 10 modalities
and testing for 20 prediction tasks across 6 distinct research areas, and is designed to comprehensively
evaluate generalization across domains and modalities, complexity during training and inference, and
robustness to noisy and missing modalities. Additionally, we release M ULTI Z OO, a public toolkit
consisting of standardized implementations of > 20 core multimodal algorithms in a modular fashion
1. M ULTI B ENCH was previously published at NeurIPS 2021 (Liang et al., 2021), although the datasets and algorithms
were the central contributions of that publication, not the software. This paper focuses on the open-source software
along with a larger collection of datasets, algorithms, and evaluation metrics.

©2023 Paul Pu Liang, Yiwei Lyu, Xiang Fan, Arav Agarwal, Yun Cheng, Louis-Philippe Morency, Ruslan Salakhutdinov.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at
http://jmlr.org/papers/v24/22-1021.html.
L IANG , LYU , FAN , AGARWAL , C HENG , M ORENCY, S ALAKHUTDINOV

Domains Modalities Evaluation

Aﬀective computing Healthcare Language Image Video Performance

All I can say is he’s a

pretty average guy.

Robotics Finance Audio Time-series Force sensors Complexity

Proprioception Set Table Optical flow Robustness

HCI Multimedia

All I can
say is…

Figure 1: M ULTI B ENCH contains a diverse set of 15 datasets spanning 10 modalities and testing for more
than 20 prediction tasks across 6 distinct research areas, and enables standardized, reliable, and reproducible
large-scale benchmarking of multimodal models for performance, complexity, and robustness.

MultiBench MultiBench MultiZoo MultiBench MultiBench

datasets data loader model evaluator leaderboard

Figure 2: Our M ULTI B ENCH toolkit provides a machine learning pipeline across data processing, data loading,
multimodal models, evaluation metrics, and a public leaderboard to encourage accessible, standardized, and
reproducible research in multimodal representation learning.

to enable accessibility for new researchers, compositionality of approaches, and reproducibility of

results. Together, these public resources ensure ease of use, accessibility, and reproducibility, and
they will be continually expanded in courses, workshops, and competitions around the world.

2. M ULTI B ENCH and M ULTI Z OO

M ULTI B ENCH provides a standardized machine learning pipeline that starts from data loading to
running multimodal models, providing evaluation metrics, and a public leaderboard to encourage
future research in multimodal representation learning (see Figure 2).
M ULTI B ENCH datasets: Table 1 shows an overview of the datasets provided in M ULTI B ENCH,
which span research areas in multimedia, affective computing, robotics, finance, human-computer
interaction, and healthcare, more than 15 datasets, 10 modalities, and 20 prediction tasks.
M ULTI Z OO: A zoo of multimodal algorithms: To complement M ULTI B ENCH, we release a
comprehensive toolkit, M ULTI Z OO, as starter code for multimodal algorithms which implements
20 methods spanning different methodological innovations in (1) data preprocessing, (2) fusion
paradigms, (3) optimization objectives, and (4) training procedures (see Figure 3). Each of these
algorithms are chosen because they provide unique perspectives to the technical challenges in
multimodal learning (Baltrušaitis et al., 2018) (see Table 2 for details).
Evaluation protocol: M ULTI B ENCH contains evaluation scripts for the following holistic
desiderata in multimodal learning: (1) Performance: We standardize evaluation using MSE and
MAE for regression, as well as accuracy, micro & macro F1-score, and AUPRC for classification.
(2) Complexity: We record the amount of information taken in bits (i.e., data size), the number of
model parameters, as well as time and memory resources required during the entire training process.
Real-world models may also need to be small and compact to run on mobile devices (Radu et al.,
2016) so we also report inference time and memory on CPU and GPU. The datasets and models

2
M ULTI Z OO & M ULTI B ENCH : A T OOLKIT FOR M ULTIMODAL L EARNING

Table 1: M ULTI B ENCH provides a comprehensive suite of 15 datasets covering a diverse range of 6 research
areas, dataset sizes, 10 input modalities (in the form of `: language, i: image, v: video, a: audio, t: time-series,
ta: tabular, f : force sensor, p: proprioception sensor, s: set, o: optical flow), and 20 prediction tasks.
Research Area Size Dataset Modalities # Samples Prediction task
S MUS TARD (Castro et al., 2019) {`, v, a} 690 sarcasm
M CMU-MOSI (Zadeh et al., 2016) {`, v, a} 2, 199 sentiment
Affective Computing
L UR-FUNNY (Hasan et al., 2019) {`, v, a} 16, 514 humor
L CMU-MOSEI (Zadeh et al., 2018) {`, v, a} 22, 777 sentiment, emotions
Healthcare L MIMIC (Johnson et al., 2016) {t, ta} 36, 212 mortality, ICD-9 codes
M M U J O C O P USH (Lee et al., 2020) {i, f, p} 37, 990 object pose
Robotics
L V ISION &T OUCH (Lee et al., 2019) {i, f, p} 147, 000 contact, robot pose
M S TOCKS -F&B {t × 18} 5, 218 stock price, volatility
Finance M S TOCKS -H EALTH {t × 63} 5, 218 stock price, volatility
M S TOCKS -T ECH {t × 100} 5, 218 stock price, volatility
HCI S ENRICO (Leiva et al., 2020) {i, s} 1, 460 design interface
M MM-IMD B (Arevalo et al., 2017) {`, i} 25, 959 movie genre
Multimedia M AV-MNIST (Vielzeuf et al., 2018) {i, a} 70, 000 digit
L K INETICS 400 (Kay et al., 2017) {v, a, o} 306, 245 human action

Data preprocessing Unimodal models Fusion paradigms Optimization objectives Training procedures

It gives me much insight

Figure 3: M ULTI Z OO provides a standardized implementation of multimodal methods in a modular fashion to

enable accessibility for new researchers, compositionality of approaches, and reproducibility of results.

included are designed to span a range of compute times from 1 minute to 6 hours, memory from 2GB
to 12GB, models from 0.01 million to 280 million parameters, and datasets from 690 to 147,000
samples. (3) Robustness: The toolkit includes both modality-specific imperfections taking into
account each modality’s unique noise topologies (i.e., flips and crops of images, natural misspellings
in text, abbreviations in spoken audio), and multimodal imperfections across modalities (e.g., missing
modalities, or a chunk of time missing in time-series data) (Liang et al., 2019; Pham et al., 2019).
Installation, testing, and integration: Our documentation provides installation instructions in
Linux, MacOS, and Windows. We also include a suite of unit tests (testing self-contained functions)
and integration tests (testing multiple components from across the unimodal, fusion, and training
loop modules together) with 100% coverage for self-contained functions and 88% coverage overall
including integration tests. We also include instructions for continuous integration: our software is
hosted on GitHub which enables version control and integration via pull requests and merges. We
enabled GitHub Actions workflows, which automatically runs the test builds and is triggered every
time new changes are incorporated. After making the desired changes and making sure all tests pass,
users can create a pull request and the authors will merge these changes into the main branch.
Together: In Algorithm 1, we show a sample code snippet in Python that loads a dataset, defines
the unimodal and multimodal architectures, optimization objective, and training procedures, before
running the evaluation protocol. Our toolkit is easy to use and trains models in less than 10 lines of
code. By standardizing the implementation of each module and disentangling individual modules,
optimizations, and training, M ULTI Z OO ensures accessibility and reproducibility of its algorithms.

3
L IANG , LYU , FAN , AGARWAL , C HENG , M ORENCY, S ALAKHUTDINOV

Table 2: M ULTI Z OO provides a standardized implementation of the following multimodal methods spanning
data processing, fusion paradigms, optimization objectives, and training procedures, which offer complemen-
tary perspectives towards tackling multimodal challenges in alignment, complementarity, and robustness.

Algorithm 1 PyTorch code integrating M ULTI B ENCH datasets and M ULTI Z OO models.
from datasets.get_data import get_dataloader
from unimodals.common_models import ResNet, Transformer
from fusions.common_fusions import MultInteractions
from training_structures.gradient_blend import train, test
# load Multimodal IMDB dataset
traindata, validdata, testdata = get_dataloader(’multimodal_imdb’)
out_channels = 3
# define ResNet and Transformer unimodal encoders
encoders = [ResNet(in_channels=1, out_channels=3, layers=5),
Transformer(in_channels=1, out_channels=3, layers=3)]
# define a Multiplicative Interactions fusion layer
fusion = MultInteractions([out_channels*8, out_channels*32], out_channels*32, ’matrix’)
classifier = MLP(out_channels*32, 100, labels=23)
# train using Gradient Blend algorithm
model = train(encoders, fusion, classifier, traindata, validdata,
epochs=100, optimtype=torch.optim.SGD, lr=0.01, weight_decay=0.0001)
# test
performance, complexity, robustness = test(model, testdata)

3. Results
M ULTI Z OO and M ULTI B ENCH enable quick experimentation of multimodal algorithms for perfor-
mance while balancing complexity and robustness. They uncover several shortcomings of current
models, including poor generalization to out-of-domain tasks, tradeoffs between performance and
efficiency, and lack of robustness to real-world imperfections. Our resources also pave the way
toward answering novel research questions in multimodal transfer learning, multi-task learning,
co-learning, pre-training, and interpretability. We include these results and discussions in our full
paper (Liang et al., 2021) as well as scripts to reproduce these results in M ULTI B ENCH software.

4. Conclusion
In conclusion, we present M ULTI Z OO and M ULTI B ENCH, a large-scale open-source toolkit uni-
fying previously disjoint efforts in multimodal research with a focus on ease of use, accessibility,
and reproducibility, thereby enabling a deeper understanding of multimodal models. Through its
unprecedented range of research areas, datasets, modalities, tasks, and evaluation metrics, our toolkit
paves the way toward building more generalizable, lightweight, and robust multimodal models.

4
M ULTI Z OO & M ULTI B ENCH : A T OOLKIT FOR M ULTIMODAL L EARNING

Acknowledgements
This material is based upon work partially supported by Meta, National Science Foundation awards
1722822 and 1750439, and National Institutes of Health awards R01MH125740, R01MH132225,
R01MH096951 and R21MH130767. PPL is partially supported by a Facebook PhD Fellowship and a
Carnegie Mellon University’s Center for Machine Learning and Health Fellowship. RS is supported
in part by ONR N000141812861, ONR N000142312368 and DARPA/AFRL FA87502321015.
Any opinions, findings, conclusions, or recommendations expressed in this material are those of
the author(s) and do not necessarily reflect the views of the NSF, NIH, Meta, Carnegie Mellon
University’s Center for Machine Learning and Health, ONR, DARPA, or AFRL, and no official
endorsement should be inferred. We are extremely grateful to the editor and reviewers for their
valuable feedback. Finally, we acknowledge NVIDIA’s GPU support.

References
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In International
conference on machine learning, pages 1247–1255. PMLR, 2013.

John Arevalo, Thamar Solorio, Manuel Montes-y Gómez, and Fabio A González. Gated multimodal units for information
fusion. In 5th International conference on learning representations 2017 workshop, 2017.

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy.
IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.

Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmermann, Rada Mihalcea, and Soujanya Poria.
Towards multimodal sarcasm detection (an _obviously_ perfect paper). In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics, pages 4619–4629, 2019.

Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, and Louis-Philippe Morency. Multimodal
sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International
Conference on Multimodal Interaction, pages 163–171, 2017.

Bruno Dumas, Denis Lalanne, and Sharon Oviatt. Multimodal interfaces: A survey of principles, models and frameworks.
In Human machine interaction, pages 3–26. Springer, 2009.

C. A. Frantzidis, C. Bratsas, M. A. Klados, E. Konstantinidis, C. D. Lithari, A. B. Vivas, C. L. Papadelis, E. Kaldoudi,

C. Pappas, and P. D. Bamidis. On the classification of emotional biosignals evoked while viewing affective pictures: An
integrated data-mining-based approach for healthcare applications. IEEE Transactions on Information Technology in
Biomedicine, March 2010.

Itai Gat, Idan Schwartz, Alexander Schwing, and Tamir Hazan. Removing bias in multi-modal classifiers: Regularization
by maximizing functional entropies. Advances in Neural Information Processing Systems, 33, 2020.

Md Kamrul Hasan, Wasifur Rahman, AmirAli Bagher Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe
Morency, and Mohammed Ehsan Hoque. Ur-funny: A multimodal language dataset for understanding humor. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2046–2056, 2019.

Markus A. Hollerer, Dennis Jancsary, and Maria Grafstrom. A picture is worth a thousand words: Multimodal sensemaking
of the global financial crisis. Organization Studies, 2018.

Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye
Teh, Tim Harley, and Razvan Pascanu. Multiplicative interactions and where to find them. In International Conference
on Learning Representations, 2020. URL https://openreview.net/forum?id=rylnK6VtDH.

5
L IANG , LYU , FAN , AGARWAL , C HENG , M ORENCY, S ALAKHUTDINOV

Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin
Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database.
Scientific data, 3(1):1–9, 2016.

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim
Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950,
2017.

Elsa A Kirchner, Stephen H Fairclough, and Frank Kirchner. Embedded multimodal interfaces in robotics: applications,
future trends, and societal implications. In The Handbook of Multimodal-Multisensor Interfaces: Language Processing,
Software, Commercialization, and Emerging Directions-Volume 3, pages 523–576. 2019.

Michelle A Lee, Yuke Zhu, Krishnan Srinivasan, Parth Shah, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Jeannette
Bohg. Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks.
In 2019 International Conference on Robotics and Automation (ICRA), pages 8943–8950. IEEE, 2019.

Michelle A Lee, Brent Yi, Roberto Martín-Martín, Silvio Savarese, and Jeannette Bohg. Multimodal sensor fusion with
differentiable filters. IROS, 2020.

Luis A Leiva, Asutosh Hota, and Antti Oulasvirta. Enrico: A dataset for topic modeling of mobile ui designs. In 22nd
International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI’20 Extended
Abstracts), 2020.

Paul Pu Liang, Zhun Liu, Yao-Hung Hubert Tsai, Qibin Zhao, Ruslan Salakhutdinov, and Louis-Philippe Morency.
Learning representations from imperfect time series data via tensor rank regularization. In ACL, 2019.

Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Yufan Chen, Peter Wu, Michelle A Lee,
Yuke Zhu, et al. Multibench: Multiscale benchmarks for multimodal representation learning. In Thirty-fifth Conference
on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.

Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. Foundations and recent trends in multimodal machine learning:
Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.

Paul Pu Liang, Yiwei Lyu, Xiang Fan, Jeffrey Tsaw, Yudong Liu, Shentong Mo, Dani Yogatama, Louis-Philippe Morency,
and Russ Salakhutdinov. High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for
high-modality representation learning. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL
https://openreview.net/forum?id=ttzypy3kT7.

Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe
Morency. Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2247–2256, 2018.

M. Naphade, J. R. Smith, J. Tesic, Shih-Fu Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J. Curtis. Large-scale concept
ontology for multimedia. IEEE MultiMedia, 2006.

Zeljko Obrenovic and Dusan Starcevic. Modeling multimodal human-computer interaction. Computer, 37(9):65–72, 2004.

Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, and Frédéric Jurie. Mfas: Multimodal
fusion architecture search. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pages
6966–6975, 2019.

Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. Found in translation: Learning
robust joint representations by cyclic translations between modalities. In AAAI, 2019.

Johannes Pittermann, Angela Pittermann, and Wolfgang Minker. Emotion recognition and adaptation in spoken dialogue
systems. International Journal of Speech Technology, 2010.

Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. A review of affective computing: From unimodal analysis
to multimodal fusion. Information Fusion, 2017.

6
M ULTI Z OO & M ULTI B ENCH : A T OOLKIT FOR M ULTIMODAL L EARNING

Valentin Radu, Nicholas D Lane, Sourav Bhattacharya, Cecilia Mascolo, Mahesh K Marina, and Fahim Kawsar. Towards
multimodal deep learning for activity recognition on mobile devices. In Proceedings of the 2016 ACM International
Joint Conference on Pervasive and Ubiquitous Computing: Adjunct, pages 185–188, 2016.

Sethuraman Sankaran, David Yang, and Ser-Nam Lim. Multimodal fusion refiner networks. arXiv preprint
arXiv:2104.03435, 2021.

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov.
Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics, pages 6558–6569, 2019a.

Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. Learning
factorized multimodal representations. In ICLR, 2019b.

Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, and Frédéric Jurie. Centralnet: a multilayer approach for multimodal
fusion, 2018.

Weiyao Wang, Du Tran, and Matt Feiszli. What makes training multi-modal classification networks hard? In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12695–12705, 2020.

Mike Wu and Noah Goodman. Multimodal generative models for scalable weakly-supervised learning. In Advances in
Neural Information Processing Systems, pages 5575–5585, 2018.

Keyang Xu, Mike Lam, Jingzhi Pang, Xin Gao, Charlotte Band, Piyush Mathur, Frank Papay, Ashish K Khanna, Jacek B
Cywinski, Kamal Maheshwari, et al. Multimodal machine learning for automated icd coding. In Machine Learning for
Healthcare Conference, pages 197–215. PMLR, 2019.

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. Mosi: multimodal corpus of sentiment intensity and
subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259, 2016.

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Tensor fusion network for
multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing, pages 1103–1114, 2017.

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language
analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In ACL, 2018.

Mini Wind Turbine Power Generation Project
100% (4)
Mini Wind Turbine Power Generation Project
32 pages
Fridman LexPhD
No ratings yet
Fridman LexPhD
67 pages
Multi Model
No ratings yet
Multi Model
36 pages
Learning to Use Medical Tools with Multi-modal Agent
No ratings yet
Learning to Use Medical Tools with Multi-modal Agent
14 pages
2023 Multimodal Large Language Models- A Survey
No ratings yet
2023 Multimodal Large Language Models- A Survey
10 pages
From Unimodal To Multimodal Machine Learning An Overview Bla Krlj pdf download
100% (1)
From Unimodal To Multimodal Machine Learning An Overview Bla Krlj pdf download
36 pages
MultiVae 8
No ratings yet
MultiVae 8
39 pages
Multimodal Machine Learning: A Survey and Taxonomy: Tadas Baltru Saitis, Chaitanya Ahuja, and Louis-Philippe Morency
No ratings yet
Multimodal Machine Learning: A Survey and Taxonomy: Tadas Baltru Saitis, Chaitanya Ahuja, and Louis-Philippe Morency
20 pages
Lecture1.1-Introduction
No ratings yet
Lecture1.1-Introduction
52 pages
Multimodal Deep Learning
No ratings yet
Multimodal Deep Learning
21 pages
2409.18142v1
No ratings yet
2409.18142v1
23 pages
2412.12661v1
No ratings yet
2412.12661v1
31 pages
The Evolution of 2024 Multimodal Model Architectures
No ratings yet
The Evolution of 2024 Multimodal Model Architectures
30 pages
Recent Advances and Trends in Multimodal Deep Learning A Review
No ratings yet
Recent Advances and Trends in Multimodal Deep Learning A Review
35 pages
Deep Multimodal Representation Learning A Survey
No ratings yet
Deep Multimodal Representation Learning A Survey
22 pages
A Survey On Multimodal Large Language Models
No ratings yet
A Survey On Multimodal Large Language Models
15 pages
DF-DM: A Foundational Process Model For Multimodal Data Fusion in The Artificial Intelligence Era
No ratings yet
DF-DM: A Foundational Process Model For Multimodal Data Fusion in The Artificial Intelligence Era
39 pages
Multimodal Learning
No ratings yet
Multimodal Learning
29 pages
MM-LLMs Recent Advances in MultiModal Large Language Models
No ratings yet
MM-LLMs Recent Advances in MultiModal Large Language Models
22 pages
Multimodal_Machine_Learning_A_Survey_and_Taxonomy
No ratings yet
Multimodal_Machine_Learning_A_Survey_and_Taxonomy
21 pages
ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning
No ratings yet
ICML2023 - Tutorial多模态机器学习Multimodal Machine Learning
120 pages
MME: A Comprehensive Evaluation Benchmark For Multimodal Large Language Models
No ratings yet
MME: A Comprehensive Evaluation Benchmark For Multimodal Large Language Models
12 pages
Vision-Language Models For Medical Report Generation and Visual Question Answering: A Review
No ratings yet
Vision-Language Models For Medical Report Generation and Visual Question Answering: A Review
43 pages
mmE5- Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
No ratings yet
mmE5- Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
21 pages
mmeval-survey
No ratings yet
mmeval-survey
31 pages
multimodel deep learning
No ratings yet
multimodel deep learning
92 pages
Perception, Reason, Think, and Plan
No ratings yet
Perception, Reason, Think, and Plan
75 pages
A Survey on Deep Multimodal Learning for Computer Vision Advances, Trends, Applications, And Datasets
No ratings yet
A Survey on Deep Multimodal Learning for Computer Vision Advances, Trends, Applications, And Datasets
32 pages
MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation
No ratings yet
MedCoDi-M: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation
40 pages
Deep Learning Book PDF
No ratings yet
Deep Learning Book PDF
272 pages
Artificial Intelligence Systems Integration: Fundamentals and Applications
From Everand
Artificial Intelligence Systems Integration: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning Algorithms for Data Scientists: An Overview
From Everand
Machine Learning Algorithms for Data Scientists: An Overview
Vinaitheerthan Renganathan
No ratings yet
The metaverse and its impact on human rights, the rule of law and democracy
From Everand
The metaverse and its impact on human rights, the rule of law and democracy
Council of Europe
No ratings yet
A Survey On Multimodal Large Language Models
No ratings yet
A Survey On Multimodal Large Language Models
18 pages
A Survey Priya[1]
No ratings yet
A Survey Priya[1]
5 pages
Means Ends Analysis: Fundamentals and Applications
From Everand
Means Ends Analysis: Fundamentals and Applications
Fouad Sabry
No ratings yet
Multimodal Deep Learning: Seminar Report On
No ratings yet
Multimodal Deep Learning: Seminar Report On
34 pages
Neat versus Scruffy: Fundamentals and Applications
From Everand
Neat versus Scruffy: Fundamentals and Applications
Fouad Sabry
No ratings yet
Session 15-1 Multimodal
No ratings yet
Session 15-1 Multimodal
82 pages
Multimodal Learning With Transformers a Survey
No ratings yet
Multimodal Learning With Transformers a Survey
20 pages
MME - A Comprehensive Evaluation Benchmark For Multimodal Large Language Models
No ratings yet
MME - A Comprehensive Evaluation Benchmark For Multimodal Large Language Models
11 pages
Virtual Intelligence: Fundamentals and Applications
From Everand
Virtual Intelligence: Fundamentals and Applications
Fouad Sabry
No ratings yet
JHC-RTF 20240402 Short
No ratings yet
JHC-RTF 20240402 Short
29 pages
Development Status and Strategy Analysis of Medical Big Uo0myigti8
No ratings yet
Development Status and Strategy Analysis of Medical Big Uo0myigti8
15 pages
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Evolution and Impact of Large Language Models in Medical practice
No ratings yet
Evolution and Impact of Large Language Models in Medical practice
12 pages
26_Sentiment analysis of linguistic cues to assist medical image classification
No ratings yet
26_Sentiment analysis of linguistic cues to assist medical image classification
20 pages
sensors-23-06986
No ratings yet
sensors-23-06986
21 pages
Mivar NETs and logical inference with the linear complexity
From Everand
Mivar NETs and logical inference with the linear complexity
Varlamov, Oleg O.
No ratings yet
Data Mining: Concepts, Fundamentals And Applications
From Everand
Data Mining: Concepts, Fundamentals And Applications
Enrico Guardelli
No ratings yet
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ulti Modal Atent Iffusion
No ratings yet
Ulti Modal Atent Iffusion
40 pages
Harnessing the Power of AI: A Guide to Making Technology Work for You
From Everand
Harnessing the Power of AI: A Guide to Making Technology Work for You
Roy Hope
No ratings yet
Sensors 23 02381 v2
No ratings yet
Sensors 23 02381 v2
16 pages
data_uncertainty
No ratings yet
data_uncertainty
15 pages
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
2412.12606v1
No ratings yet
2412.12606v1
33 pages
Mlhops: Machine Learning For Healthcare Operations: Keywords
No ratings yet
Mlhops: Machine Learning For Healthcare Operations: Keywords
86 pages
Advances and Prospects of Multi Modal Ophthalmic Artificial Intelligence Based On Deep Learning
No ratings yet
Advances and Prospects of Multi Modal Ophthalmic Artificial Intelligence Based On Deep Learning
13 pages
18-2025-18
No ratings yet
18-2025-18
40 pages
Data Science for Decision Makers: Enhance your leadership skills with data science and AI expertise
From Everand
Data Science for Decision Makers: Enhance your leadership skills with data science and AI expertise
Jon Howells
No ratings yet
_OceanofPDF.com_Multimodal_Data_Fusion_for_Bioinformatics_-_Umesh_Kumar_Lilhore
No ratings yet
_OceanofPDF.com_Multimodal_Data_Fusion_for_Bioinformatics_-_Umesh_Kumar_Lilhore
396 pages
Impact of Classification Difficulty On The Weight Matrices Spectra in Deep Learning and Application To Early-Stopping
No ratings yet
Impact of Classification Difficulty On The Weight Matrices Spectra in Deep Learning and Application To Early-Stopping
40 pages
On Truthing Issues in Supervised Classification: Jonathan K. Su
No ratings yet
On Truthing Issues in Supervised Classification: Jonathan K. Su
91 pages
A Variational Approach To Bayesian Phylogenetic Inference: Cheng Zhang
No ratings yet
A Variational Approach To Bayesian Phylogenetic Inference: Cheng Zhang
56 pages
Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems From Discrete Observations
No ratings yet
Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems From Discrete Observations
50 pages
Statistical Inference For Fairness Auditing: John J. Cherian
No ratings yet
Statistical Inference For Fairness Auditing: John J. Cherian
49 pages
Omnisafe: An Infrastructure For Accelerating Safe Reinforcement Learning Research
No ratings yet
Omnisafe: An Infrastructure For Accelerating Safe Reinforcement Learning Research
6 pages
Istilah Akuntansi Kieso
No ratings yet
Istilah Akuntansi Kieso
16 pages
Fat Trafo
100% (1)
Fat Trafo
3 pages
Ingress Protection
No ratings yet
Ingress Protection
24 pages
10.213 Chemical Engineering Thermodynamics Spring 2002 Problem Set F
No ratings yet
10.213 Chemical Engineering Thermodynamics Spring 2002 Problem Set F
2 pages
Maag Symetro
No ratings yet
Maag Symetro
12 pages
Facebow
No ratings yet
Facebow
11 pages
WS-17 - KN - Circular Motion
No ratings yet
WS-17 - KN - Circular Motion
5 pages
Site Quality Acceptance Certificate: Manager Service Quality Assurance Kalimantan RANQ, Escalation Ericsson
No ratings yet
Site Quality Acceptance Certificate: Manager Service Quality Assurance Kalimantan RANQ, Escalation Ericsson
4 pages
Skripsi Tanpa Bab Pembahasan
No ratings yet
Skripsi Tanpa Bab Pembahasan
75 pages
Mathematical Applications in Pharmacy
No ratings yet
Mathematical Applications in Pharmacy
35 pages
Matlab Code For Cooling Load Calculation
100% (1)
Matlab Code For Cooling Load Calculation
8 pages
Weaving Chapter5 Beat-Up PDF
No ratings yet
Weaving Chapter5 Beat-Up PDF
69 pages
Financial Management For Managers - - Unit 8 - Week 4
No ratings yet
Financial Management For Managers - - Unit 8 - Week 4
3 pages
TEI6006TNnovember 6-7
No ratings yet
TEI6006TNnovember 6-7
2 pages
Ideal Gas Law Lab
No ratings yet
Ideal Gas Law Lab
12 pages
Transportation Planning Lecture
No ratings yet
Transportation Planning Lecture
31 pages
Teledyne Odom MB2 Product Leaflet
No ratings yet
Teledyne Odom MB2 Product Leaflet
2 pages
8th Grade Worksheets Linear Equations Worksheet 2
No ratings yet
8th Grade Worksheets Linear Equations Worksheet 2
6 pages
HBX 3319DS VTM - HBX 3319DS A1m PDF
No ratings yet
HBX 3319DS VTM - HBX 3319DS A1m PDF
3 pages
Automation Runtime SG4 G3.10: B&R Revision Information
No ratings yet
Automation Runtime SG4 G3.10: B&R Revision Information
20 pages
The Heart Of Cohomology Kato G instant download
No ratings yet
The Heart Of Cohomology Kato G instant download
78 pages
Optidrive Plus 3gv Manual v2.10
100% (1)
Optidrive Plus 3gv Manual v2.10
12 pages
Types of Welding
No ratings yet
Types of Welding
4 pages
Astronomy Distances
No ratings yet
Astronomy Distances
23 pages
Temperature Process Rig
No ratings yet
Temperature Process Rig
9 pages
Modified Fiat Bogie Presentation 4
67% (3)
Modified Fiat Bogie Presentation 4
56 pages
Korelasi Antara Ukuran-Ukuran Tubuh Dengan Bobot Badan Kambing Kacang Jantan Di Jawa Tengah
No ratings yet
Korelasi Antara Ukuran-Ukuran Tubuh Dengan Bobot Badan Kambing Kacang Jantan Di Jawa Tengah
5 pages
DTM0660 Data Sheet
No ratings yet
DTM0660 Data Sheet
10 pages

M Z & M B: A Standardized Toolkit For Multimodal Deep Learning

Uploaded by

M Z & M B: A Standardized Toolkit For Multimodal Deep Learning

Uploaded by

Journal of Machine Learning Research 24 (2023) 1-7 Submitted 9/22; Revised 4/23; Published 5/23

M ULTI Z OO & M ULTI B ENCH:

Editor: Antti Honkela

Domains Modalities Evaluation

All I can say is he’s a

Robotics Finance Audio Time-series Force sensors Complexity

Proprioception Set Table Optical flow Robustness

MultiBench MultiBench MultiZoo MultiBench MultiBench

to enable accessibility for new researchers, compositionality of approaches, and reproducibility of

2. M ULTI B ENCH and M ULTI Z OO

It gives me much insight

Figure 3: M ULTI Z OO provides a standardized implementation of multimodal methods in a modular fashion to

Category Method Alignment Complementarity Robustness

C. A. Frantzidis, C. Bratsas, M. A. Klados, E. Konstantinidis, C. D. Lithari, A. B. Vivas, C. L. Papadelis, E. Kaldoudi,

You might also like