SlideShare a Scribd company logo
Paper-Discussion:
AnyGPT: Unified Multimodal LLM with
Discrete Sequence Modeling
Paper source: https://arxiv.org/abs/2402.12226
Presenter: Haoxiang Shi
Why
Multimodal?
Why
Multimodal?
Video Sampling
Forms of commercial applications of multimodal LLMs
• MM in – MM out for a chatbot
(Transcription: How about a serene lakeside?)
Background
LLMs excel in NLP tasks.
Straightforward Way: train a Multi-Model LLMs
Challenges:
• High computational resource requirements.
• Need for large multimodal datasets to maintain LLM performance in language tasks.
Background
Some Explorations:
• Emu (Sun et al., 2023b)
• SEED-LLaMA (Ge et al., 2023b)
• SpeechGPT (Zhang et al., 2023a)
Limitation:
Focus on a single modality.
Background
Any-to-Any Modality:
• NExT-GPT (Wu et al., 2023)
• CoDi-2 (Tang et al., 2023a)
• Unified-IO2 (Lu et al., 2023)
These models use separately pre-trained encoders and decoders.
Drawback:
• Representational inconsistencies between inputs and outputs of the LLMs.
Background
AnyGPT: any-to-any multimodal language model
• Unified Processing: Utilizes discrete representations.
• Versatile Input: Accepts images and audio, converting them into sequences of discrete
semantic tokens.
• Capabilities: Maintains perception, understanding, reasoning, and generation at the semantic
level in an autoregressive manner.
Overviews
MM Tokenizers + LLM + MM De-tokenizers
Advantage:
•Simple Fusion Method: Streamlined integration of modalities.
•Less Complex Structure: Easier to implement and maintain.
Model Details- Image Tokenizer
SEED tokenizer
VIT: 16*16 patches
Causal Q-former: 32 causal embedding
CodeBook: sequence of quantized codes
MLP+SD decoder: restoring the
generation embedding to the original image.
Model Details- Speech Tokenizer
SpeechTokenizer:
Single-Channel Audio Sequences: Transformed into a
discretized matrix.
Eight Hierarchical Quantizers:
First Layer: Captures semantic content.
Layers 2 to 8: Encode paralinguistic details.
Model Details- Music Tokenizer
Encodec:
• Quantization using RVQ:
• Four Quantizers: Efficiently
encode audio.
• Music Encoding: 5 seconds of
music converted into 250 latent
frames.
• Output Matrix: Generates a 250 ×
4 codes matrix.
Model Details- Vocabulary Sampling
Model Details- Backbone
Expanding Vocabulary:
• Expanding the vocabulary with new modality-specific tokens.
Unified Multimodal Language Model:
• Equipped with modality-specific tokenizers, trained by the language model
using next token prediction loss.
LLM Choice:
• LLAMA2-7B
Model Details- Multimodal Generation
Generating High-Quality Multimodal Data:
• Requirement: A large number of bits.
• Long Sequences: Increased computational resources needed.
Solution: Two-Stage Approach
Stage 1:
1. Autoregressive LLM Training: For semantic alignment.
Stage 2:
1. Non-Autoregressive Models: For generating high-fidelity multimodal content.
Dataset Constructure- Data source
Language-Centric Construction Method:
• Image-Text Datasets:
• LAION-2B
• LAION-COCO
• LAION-Aesthetics
• Speech-Text Datasets:
• Gigaspeech
• Common Voice
• Multilingual LibriSpeech
• Music-Text:
• Crawled over one million music videos and formatted as JSON files.
• Used GPT-4 for caption generation.
Dataset
distribution
Dataset Constructure- Multimodal Interleaved Instruction
Generation of Text-Based Conversations
Text-to-Multimodality Conversion
After filtering, 108K data points were selected.
Experiment
• Image tasks
• Speech tasks
• Music tasks
Experiment
Image tasks:
• Image Understanding(Benchmark: cococap2014 )
Experiment
Image tasks:
• Image generation (Benchmark: cococap2014)
CLIP Score: computing a similarity
score between the generated image
and its corresponding caption from a
real image
Experiment
Speech tasks:
• Automatic Speech Recognition(Benchmark: LibriSpeech dataset )
Experiment
Speech tasks:
• Text-to-Speech (Benchmark: VCTK dataset)
Experiment
Music tasks (MusicCaps benchmark):
Limitation and future work
Author Mentions:
• Enhancing LLMs:
• Higher loss observed compared to unimodal training. Possible use of
Mixture of Experts (MoE)?
• Better Tokenizer:
• The tokenizer’s quality sets a ceiling for the model’s comprehension
and generative potential.
• Longer Context:
• Maximum length for music generation is 5 seconds.
Limitation and future work
Performance:
Scale of training
dataset?
Two-Stage Generation
Method:
Contradiction: This approach
may contradict their claim to
unify the encoder and decoder.
Computation
complexity
reduction?
Haoxiang Shi, D3 student, Sakai Lab, Waseda University 28
Thanks for your attention
Reference
1. Zhan, Jun, et al. "Anygpt: Unified multimodal llm with discrete sequence modeling." arXiv preprint
arXiv:2402.12226 (2024).
2. Sun, Quan, et al. "Generative pretraining in multimodality." arXiv preprint arXiv:2307.05222 (2023).
3. Ge, Yuying, et al. "Making llama see and draw with seed tokenizer." arXiv preprint arXiv:2310.01218 (2023).
4. Zhang, Dong, et al. "Speechgpt: Empowering large language models with intrinsic cross-modal
conversational abilities." arXiv preprint arXiv:2305.11000 (2023).
5. Wu, Shengqiong, et al. "Next-gpt: Any-to-any multimodal llm." arXiv preprint arXiv:2309.05519 (2023).
6. Tang, Zineng, et al. "CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation." Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
7. Lu, Jiasen, et al. "Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and
Action." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
Ad

More Related Content

Similar to 社内勉強会資料_AnyGPT_Unified Multimodal LLM with Discrete Sequence Modeling (20)

Retrieval_Augumented_Generation_GenAI.pptx
Retrieval_Augumented_Generation_GenAI.pptxRetrieval_Augumented_Generation_GenAI.pptx
Retrieval_Augumented_Generation_GenAI.pptx
creative sam
 
Retrieval Augmented Generator presentation slide part1
Retrieval Augmented Generator presentation slide part1Retrieval Augmented Generator presentation slide part1
Retrieval Augmented Generator presentation slide part1
ViswakarmaChakravart
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Jinho Choi
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
PyData
 
Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
Matīss ‎‎‎‎‎‎‎  
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
Bernardo Najlis
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
NAVER Engineering
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
WarNik Chow
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
Jayavardhan Reddy Peddamail
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
ssuser4b1f48
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
Lionel Briand
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
mustafa sarac
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 Neural Network Language Models for Candidate Scoring in Multi-System Machine... Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Matīss ‎‎‎‎‎‎‎  
 
Master Thesis Slides: Topic Development of Methods for Deep Neural Network Ar...
Master Thesis Slides: Topic Development of Methods for Deep Neural Network Ar...Master Thesis Slides: Topic Development of Methods for Deep Neural Network Ar...
Master Thesis Slides: Topic Development of Methods for Deep Neural Network Ar...
Denis Zakharov
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
MLconf
 
A summary of various COMBINE standardization activities
A summary of various COMBINE standardization activitiesA summary of various COMBINE standardization activities
A summary of various COMBINE standardization activities
Mike Hucka
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learning
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learningIEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learning
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learning
IEEEBEBTECHSTUDENTPROJECTS
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
Dataconomy Media
 
gpt3_presentation.pdf
gpt3_presentation.pdfgpt3_presentation.pdf
gpt3_presentation.pdf
Giacomo Frisoni
 
Retrieval_Augumented_Generation_GenAI.pptx
Retrieval_Augumented_Generation_GenAI.pptxRetrieval_Augumented_Generation_GenAI.pptx
Retrieval_Augumented_Generation_GenAI.pptx
creative sam
 
Retrieval Augmented Generator presentation slide part1
Retrieval Augmented Generator presentation slide part1Retrieval Augmented Generator presentation slide part1
Retrieval Augmented Generator presentation slide part1
ViswakarmaChakravart
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Jinho Choi
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
PyData
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
Bernardo Najlis
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
NAVER Engineering
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
WarNik Chow
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
Jayavardhan Reddy Peddamail
 
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...NS-CUK Seminar: H.B.Kim,  Review on "metapath2vec: Scalable representation le...
NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation le...
ssuser4b1f48
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
Lionel Briand
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
mustafa sarac
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 Neural Network Language Models for Candidate Scoring in Multi-System Machine... Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Matīss ‎‎‎‎‎‎‎  
 
Master Thesis Slides: Topic Development of Methods for Deep Neural Network Ar...
Master Thesis Slides: Topic Development of Methods for Deep Neural Network Ar...Master Thesis Slides: Topic Development of Methods for Deep Neural Network Ar...
Master Thesis Slides: Topic Development of Methods for Deep Neural Network Ar...
Denis Zakharov
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
MLconf
 
A summary of various COMBINE standardization activities
A summary of various COMBINE standardization activitiesA summary of various COMBINE standardization activities
A summary of various COMBINE standardization activities
Mike Hucka
 
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learning
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learningIEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learning
IEEE 2014 MATLAB IMAGE PROCESSING PROJECTS Scale adaptive dictionary learning
IEEEBEBTECHSTUDENTPROJECTS
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
Dataconomy Media
 

More from NABLAS株式会社 (20)

π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Transformers without Normalization .
Transformers without Normalization        .Transformers without Normalization        .
Transformers without Normalization .
NABLAS株式会社
 
社内勉強会資料_Data-Centric AI in The Age of Large Language Models
社内勉強会資料_Data-Centric AI in The Age of Large Language Models社内勉強会資料_Data-Centric AI in The Age of Large Language Models
社内勉強会資料_Data-Centric AI in The Age of Large Language Models
NABLAS株式会社
 
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue
NABLAS株式会社
 
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
NABLAS株式会社
 
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models
NABLAS株式会社
 
社内勉強会資料_Pruning in Large Language Models
社内勉強会資料_Pruning in Large Language Models社内勉強会資料_Pruning in Large Language Models
社内勉強会資料_Pruning in Large Language Models
NABLAS株式会社
 
社内勉強会資料_Human-level control through deep reinforcement learning
社内勉強会資料_Human-level control through deep reinforcement learning社内勉強会資料_Human-level control through deep reinforcement learning
社内勉強会資料_Human-level control through deep reinforcement learning
NABLAS株式会社
 
社内勉強会資料_Skywork-MoE .
社内勉強会資料_Skywork-MoE                     .社内勉強会資料_Skywork-MoE                     .
社内勉強会資料_Skywork-MoE .
NABLAS株式会社
 
勉強会資料_PointLLM .
勉強会資料_PointLLM                           .勉強会資料_PointLLM                           .
勉強会資料_PointLLM .
NABLAS株式会社
 
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRag
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRagRecipe Generation:Retrieval from Videos - Multi-Modal RecipeRag
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRag
NABLAS株式会社
 
社内勉強会資料_StepByStep Build own RAG. .
社内勉強会資料_StepByStep Build own RAG.       .社内勉強会資料_StepByStep Build own RAG.       .
社内勉強会資料_StepByStep Build own RAG. .
NABLAS株式会社
 
社内勉強会資料_History of LLaVA .
社内勉強会資料_History of LLaVA                .社内勉強会資料_History of LLaVA                .
社内勉強会資料_History of LLaVA .
NABLAS株式会社
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
NABLAS株式会社
 
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
NABLAS株式会社
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
社内勉強会資料_Two Papers Contribute to Faster Python.pdf
社内勉強会資料_Two Papers Contribute to Faster Python.pdf社内勉強会資料_Two Papers Contribute to Faster Python.pdf
社内勉強会資料_Two Papers Contribute to Faster Python.pdf
NABLAS株式会社
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
NABLAS株式会社
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Transformers without Normalization .
Transformers without Normalization        .Transformers without Normalization        .
Transformers without Normalization .
NABLAS株式会社
 
社内勉強会資料_Data-Centric AI in The Age of Large Language Models
社内勉強会資料_Data-Centric AI in The Age of Large Language Models社内勉強会資料_Data-Centric AI in The Age of Large Language Models
社内勉強会資料_Data-Centric AI in The Age of Large Language Models
NABLAS株式会社
 
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue
NABLAS株式会社
 
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
NABLAS株式会社
 
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models
NABLAS株式会社
 
社内勉強会資料_Pruning in Large Language Models
社内勉強会資料_Pruning in Large Language Models社内勉強会資料_Pruning in Large Language Models
社内勉強会資料_Pruning in Large Language Models
NABLAS株式会社
 
社内勉強会資料_Human-level control through deep reinforcement learning
社内勉強会資料_Human-level control through deep reinforcement learning社内勉強会資料_Human-level control through deep reinforcement learning
社内勉強会資料_Human-level control through deep reinforcement learning
NABLAS株式会社
 
社内勉強会資料_Skywork-MoE .
社内勉強会資料_Skywork-MoE                     .社内勉強会資料_Skywork-MoE                     .
社内勉強会資料_Skywork-MoE .
NABLAS株式会社
 
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRag
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRagRecipe Generation:Retrieval from Videos - Multi-Modal RecipeRag
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRag
NABLAS株式会社
 
社内勉強会資料_StepByStep Build own RAG. .
社内勉強会資料_StepByStep Build own RAG.       .社内勉強会資料_StepByStep Build own RAG.       .
社内勉強会資料_StepByStep Build own RAG. .
NABLAS株式会社
 
社内勉強会資料_History of LLaVA .
社内勉強会資料_History of LLaVA                .社内勉強会資料_History of LLaVA                .
社内勉強会資料_History of LLaVA .
NABLAS株式会社
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
NABLAS株式会社
 
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
NABLAS株式会社
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
社内勉強会資料_Two Papers Contribute to Faster Python.pdf
社内勉強会資料_Two Papers Contribute to Faster Python.pdf社内勉強会資料_Two Papers Contribute to Faster Python.pdf
社内勉強会資料_Two Papers Contribute to Faster Python.pdf
NABLAS株式会社
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
NABLAS株式会社
 
Ad

Recently uploaded (20)

03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Ad

社内勉強会資料_AnyGPT_Unified Multimodal LLM with Discrete Sequence Modeling

  • 1. Paper-Discussion: AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Paper source: https://arxiv.org/abs/2402.12226 Presenter: Haoxiang Shi
  • 4. Forms of commercial applications of multimodal LLMs • MM in – MM out for a chatbot (Transcription: How about a serene lakeside?)
  • 5. Background LLMs excel in NLP tasks. Straightforward Way: train a Multi-Model LLMs Challenges: • High computational resource requirements. • Need for large multimodal datasets to maintain LLM performance in language tasks.
  • 6. Background Some Explorations: • Emu (Sun et al., 2023b) • SEED-LLaMA (Ge et al., 2023b) • SpeechGPT (Zhang et al., 2023a) Limitation: Focus on a single modality.
  • 7. Background Any-to-Any Modality: • NExT-GPT (Wu et al., 2023) • CoDi-2 (Tang et al., 2023a) • Unified-IO2 (Lu et al., 2023) These models use separately pre-trained encoders and decoders. Drawback: • Representational inconsistencies between inputs and outputs of the LLMs.
  • 8. Background AnyGPT: any-to-any multimodal language model • Unified Processing: Utilizes discrete representations. • Versatile Input: Accepts images and audio, converting them into sequences of discrete semantic tokens. • Capabilities: Maintains perception, understanding, reasoning, and generation at the semantic level in an autoregressive manner.
  • 9. Overviews MM Tokenizers + LLM + MM De-tokenizers Advantage: •Simple Fusion Method: Streamlined integration of modalities. •Less Complex Structure: Easier to implement and maintain.
  • 10. Model Details- Image Tokenizer SEED tokenizer VIT: 16*16 patches Causal Q-former: 32 causal embedding CodeBook: sequence of quantized codes MLP+SD decoder: restoring the generation embedding to the original image.
  • 11. Model Details- Speech Tokenizer SpeechTokenizer: Single-Channel Audio Sequences: Transformed into a discretized matrix. Eight Hierarchical Quantizers: First Layer: Captures semantic content. Layers 2 to 8: Encode paralinguistic details.
  • 12. Model Details- Music Tokenizer Encodec: • Quantization using RVQ: • Four Quantizers: Efficiently encode audio. • Music Encoding: 5 seconds of music converted into 250 latent frames. • Output Matrix: Generates a 250 × 4 codes matrix.
  • 14. Model Details- Backbone Expanding Vocabulary: • Expanding the vocabulary with new modality-specific tokens. Unified Multimodal Language Model: • Equipped with modality-specific tokenizers, trained by the language model using next token prediction loss. LLM Choice: • LLAMA2-7B
  • 15. Model Details- Multimodal Generation Generating High-Quality Multimodal Data: • Requirement: A large number of bits. • Long Sequences: Increased computational resources needed. Solution: Two-Stage Approach Stage 1: 1. Autoregressive LLM Training: For semantic alignment. Stage 2: 1. Non-Autoregressive Models: For generating high-fidelity multimodal content.
  • 16. Dataset Constructure- Data source Language-Centric Construction Method: • Image-Text Datasets: • LAION-2B • LAION-COCO • LAION-Aesthetics • Speech-Text Datasets: • Gigaspeech • Common Voice • Multilingual LibriSpeech • Music-Text: • Crawled over one million music videos and formatted as JSON files. • Used GPT-4 for caption generation.
  • 18. Dataset Constructure- Multimodal Interleaved Instruction Generation of Text-Based Conversations Text-to-Multimodality Conversion After filtering, 108K data points were selected.
  • 19. Experiment • Image tasks • Speech tasks • Music tasks
  • 20. Experiment Image tasks: • Image Understanding(Benchmark: cococap2014 )
  • 21. Experiment Image tasks: • Image generation (Benchmark: cococap2014) CLIP Score: computing a similarity score between the generated image and its corresponding caption from a real image
  • 22. Experiment Speech tasks: • Automatic Speech Recognition(Benchmark: LibriSpeech dataset )
  • 23. Experiment Speech tasks: • Text-to-Speech (Benchmark: VCTK dataset)
  • 25. Limitation and future work Author Mentions: • Enhancing LLMs: • Higher loss observed compared to unimodal training. Possible use of Mixture of Experts (MoE)? • Better Tokenizer: • The tokenizer’s quality sets a ceiling for the model’s comprehension and generative potential. • Longer Context: • Maximum length for music generation is 5 seconds.
  • 26. Limitation and future work Performance: Scale of training dataset? Two-Stage Generation Method: Contradiction: This approach may contradict their claim to unify the encoder and decoder. Computation complexity reduction?
  • 27. Haoxiang Shi, D3 student, Sakai Lab, Waseda University 28 Thanks for your attention
  • 28. Reference 1. Zhan, Jun, et al. "Anygpt: Unified multimodal llm with discrete sequence modeling." arXiv preprint arXiv:2402.12226 (2024). 2. Sun, Quan, et al. "Generative pretraining in multimodality." arXiv preprint arXiv:2307.05222 (2023). 3. Ge, Yuying, et al. "Making llama see and draw with seed tokenizer." arXiv preprint arXiv:2310.01218 (2023). 4. Zhang, Dong, et al. "Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities." arXiv preprint arXiv:2305.11000 (2023). 5. Wu, Shengqiong, et al. "Next-gpt: Any-to-any multimodal llm." arXiv preprint arXiv:2309.05519 (2023). 6. Tang, Zineng, et al. "CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. 7. Lu, Jiasen, et al. "Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.