SlideShare a Scribd company logo
π0.5
: a Vision-Language-Action Model with Open-World
Generalization
A paper by
Let’s talk about
● Transfusion
○ An architecture mixing autoregression and diffusion with a single Transformer
○ https://arxiv.org/abs/2408.11039
● π0
○ A robot foundation model based on Transfusion
○ https://arxiv.org/abs/2410.24164
● FAST
○ An action representation method
○ https://arxiv.org/abs/2501.09747
● π0.5
○ An improved π0
with better embodied reasoning and planning
○ https://www.physicalintelligence.company/download/pi05.pdf
Transfusion
● A single Transformer does diffusion and autoregressive token prediction
● At inference, when a BOI token is outputted, the model switches into image diffusion mode
● N tokens of noise are injected and the diffusion process is run
● After, a EOI token is outputted and autoregressive token prediction continues
π0
: A Vision-Language-Action Flow Model for General Robot Control
● VLM extended with an action expert
○ Similar to mixture-of-experts with 2 experts and a special routing method
● The VLM processes vision and language instruction
● The action expert uses flow-matching to generate actions
● Both interact through attention
● Trained to predict robot action
FAST
● Takes inspiration from image encoding techniques (JPEG)
● Compresses action sequences
● Uses the frequency space to encode the images
● Learn a BPE tokenizer on top
π0.5
: A Vision-Language-Action Model with Open-World Generalization
● Same VLM/Action Expert architecture
● First trained to predict VQA style information
○ Uses the FAST tokenizer to predict actions autoregressively
● Then the action-expert is added and the model is post-trained to output
continuous actions
π0.5: a Vision-Language-Action Model with Open-World Generalization
Interesting bits
● π0.5
can do simple planning which π0
could not do
○ π0
could be combined with other methods
○ Using GPT-4 as a planner doesn’t perform very well
● π0.5
has a stronger focus on cross-environment than π0
● Evaluations are done with a scoring system that allows to appreciate partial
success
○ Many evaluation of robotic system are binary which can be difficult to
interpret when the goals are complex
Let’s look at the robots
How to evaluate VLMs embodied reasoning capabilities?
● Embodied reasoning is becoming more and more popular
● We can use the ERQA benchmark
○ https://github.com/embodiedreasoning/ERQA
○ Comes from Gemini Robotics
● Current scores:
Qwen2.5-VL-3B-Instruct
Ad

More Related Content

Similar to π0.5: a Vision-Language-Action Model with Open-World Generalization (20)

SFO15-102:ODP Project Update
SFO15-102:ODP Project UpdateSFO15-102:ODP Project Update
SFO15-102:ODP Project Update
Linaro
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit London
HostedbyConfluent
 
Continuous Integration In Php
Continuous Integration In PhpContinuous Integration In Php
Continuous Integration In Php
Wilco Jansen
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
Kevin Brockhoff
 
Python Django Intro V0.1
Python Django Intro V0.1Python Django Intro V0.1
Python Django Intro V0.1
Udi Bauman
 
Shake that-fud-vrs5
Shake that-fud-vrs5Shake that-fud-vrs5
Shake that-fud-vrs5
wimjongman
 
Who needs containers in a serverless world
Who needs containers in a serverless worldWho needs containers in a serverless world
Who needs containers in a serverless world
Matthias Luebken
 
An Introduction to PyPy
An Introduction to PyPyAn Introduction to PyPy
An Introduction to PyPy
Michael Hudson-Doyle
 
How to establish ways of working that allows shifting-left of the automation ...
How to establish ways of working that allows shifting-left of the automation ...How to establish ways of working that allows shifting-left of the automation ...
How to establish ways of working that allows shifting-left of the automation ...
Max Barrass
 
Journeys with Transmogrifier and friends or How not to get stuck in the Plone...
Journeys with Transmogrifier and friends or How not to get stuck in the Plone...Journeys with Transmogrifier and friends or How not to get stuck in the Plone...
Journeys with Transmogrifier and friends or How not to get stuck in the Plone...
Daniel Jowett
 
jBPM5 Developer Guide Presentation JBUG London
jBPM5 Developer Guide Presentation JBUG LondonjBPM5 Developer Guide Presentation JBUG London
jBPM5 Developer Guide Presentation JBUG London
Mauricio (Salaboy) Salatino
 
jBPM5 - The Evolution of BPM Systems
jBPM5 - The Evolution of BPM SystemsjBPM5 - The Evolution of BPM Systems
jBPM5 - The Evolution of BPM Systems
JBUG London
 
Open source, What | Why | How
Open source, What | Why | How Open source, What | Why | How
Open source, What | Why | How
Nikhil Agrawal
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
Theodoros Vasiloudis
 
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesPyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
Viach Kakovskyi
 
Plomino plone conf2010
Plomino plone conf2010Plomino plone conf2010
Plomino plone conf2010
ebrehault
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Neo4j
 
Pentester++
Pentester++Pentester++
Pentester++
CTruncer
 
Cloud Native CI/CD with Spring Cloud Pipelines
Cloud Native CI/CD with Spring Cloud PipelinesCloud Native CI/CD with Spring Cloud Pipelines
Cloud Native CI/CD with Spring Cloud Pipelines
Lars Rosenquist
 
Cloud Native CI/CD with Spring Cloud Pipelines
Cloud Native CI/CD with Spring Cloud PipelinesCloud Native CI/CD with Spring Cloud Pipelines
Cloud Native CI/CD with Spring Cloud Pipelines
Lars Rosenquist
 
SFO15-102:ODP Project Update
SFO15-102:ODP Project UpdateSFO15-102:ODP Project Update
SFO15-102:ODP Project Update
Linaro
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit London
HostedbyConfluent
 
Continuous Integration In Php
Continuous Integration In PhpContinuous Integration In Php
Continuous Integration In Php
Wilco Jansen
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
Kevin Brockhoff
 
Python Django Intro V0.1
Python Django Intro V0.1Python Django Intro V0.1
Python Django Intro V0.1
Udi Bauman
 
Shake that-fud-vrs5
Shake that-fud-vrs5Shake that-fud-vrs5
Shake that-fud-vrs5
wimjongman
 
Who needs containers in a serverless world
Who needs containers in a serverless worldWho needs containers in a serverless world
Who needs containers in a serverless world
Matthias Luebken
 
How to establish ways of working that allows shifting-left of the automation ...
How to establish ways of working that allows shifting-left of the automation ...How to establish ways of working that allows shifting-left of the automation ...
How to establish ways of working that allows shifting-left of the automation ...
Max Barrass
 
Journeys with Transmogrifier and friends or How not to get stuck in the Plone...
Journeys with Transmogrifier and friends or How not to get stuck in the Plone...Journeys with Transmogrifier and friends or How not to get stuck in the Plone...
Journeys with Transmogrifier and friends or How not to get stuck in the Plone...
Daniel Jowett
 
jBPM5 - The Evolution of BPM Systems
jBPM5 - The Evolution of BPM SystemsjBPM5 - The Evolution of BPM Systems
jBPM5 - The Evolution of BPM Systems
JBUG London
 
Open source, What | Why | How
Open source, What | Why | How Open source, What | Why | How
Open source, What | Why | How
Nikhil Agrawal
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
Theodoros Vasiloudis
 
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesPyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
Viach Kakovskyi
 
Plomino plone conf2010
Plomino plone conf2010Plomino plone conf2010
Plomino plone conf2010
ebrehault
 
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Visual, scalable, and manageable data loading to and from Neo4j with Apache Hop
Neo4j
 
Pentester++
Pentester++Pentester++
Pentester++
CTruncer
 
Cloud Native CI/CD with Spring Cloud Pipelines
Cloud Native CI/CD with Spring Cloud PipelinesCloud Native CI/CD with Spring Cloud Pipelines
Cloud Native CI/CD with Spring Cloud Pipelines
Lars Rosenquist
 
Cloud Native CI/CD with Spring Cloud Pipelines
Cloud Native CI/CD with Spring Cloud PipelinesCloud Native CI/CD with Spring Cloud Pipelines
Cloud Native CI/CD with Spring Cloud Pipelines
Lars Rosenquist
 

More from NABLAS株式会社 (20)

Transformers without Normalization .
Transformers without Normalization        .Transformers without Normalization        .
Transformers without Normalization .
NABLAS株式会社
 
社内勉強会資料_Data-Centric AI in The Age of Large Language Models
社内勉強会資料_Data-Centric AI in The Age of Large Language Models社内勉強会資料_Data-Centric AI in The Age of Large Language Models
社内勉強会資料_Data-Centric AI in The Age of Large Language Models
NABLAS株式会社
 
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue
NABLAS株式会社
 
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
NABLAS株式会社
 
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models
NABLAS株式会社
 
社内勉強会資料_Pruning in Large Language Models
社内勉強会資料_Pruning in Large Language Models社内勉強会資料_Pruning in Large Language Models
社内勉強会資料_Pruning in Large Language Models
NABLAS株式会社
 
社内勉強会資料_Human-level control through deep reinforcement learning
社内勉強会資料_Human-level control through deep reinforcement learning社内勉強会資料_Human-level control through deep reinforcement learning
社内勉強会資料_Human-level control through deep reinforcement learning
NABLAS株式会社
 
社内勉強会資料_Skywork-MoE .
社内勉強会資料_Skywork-MoE                     .社内勉強会資料_Skywork-MoE                     .
社内勉強会資料_Skywork-MoE .
NABLAS株式会社
 
勉強会資料_PointLLM .
勉強会資料_PointLLM                           .勉強会資料_PointLLM                           .
勉強会資料_PointLLM .
NABLAS株式会社
 
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRag
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRagRecipe Generation:Retrieval from Videos - Multi-Modal RecipeRag
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRag
NABLAS株式会社
 
社内勉強会資料_StepByStep Build own RAG. .
社内勉強会資料_StepByStep Build own RAG.       .社内勉強会資料_StepByStep Build own RAG.       .
社内勉強会資料_StepByStep Build own RAG. .
NABLAS株式会社
 
社内勉強会資料_History of LLaVA .
社内勉強会資料_History of LLaVA                .社内勉強会資料_History of LLaVA                .
社内勉強会資料_History of LLaVA .
NABLAS株式会社
 
社内勉強会資料_AnyGPT_Unified Multimodal LLM with Discrete Sequence Modeling
社内勉強会資料_AnyGPT_Unified Multimodal LLM with  Discrete Sequence Modeling社内勉強会資料_AnyGPT_Unified Multimodal LLM with  Discrete Sequence Modeling
社内勉強会資料_AnyGPT_Unified Multimodal LLM with Discrete Sequence Modeling
NABLAS株式会社
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
NABLAS株式会社
 
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
NABLAS株式会社
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
社内勉強会資料_Two Papers Contribute to Faster Python.pdf
社内勉強会資料_Two Papers Contribute to Faster Python.pdf社内勉強会資料_Two Papers Contribute to Faster Python.pdf
社内勉強会資料_Two Papers Contribute to Faster Python.pdf
NABLAS株式会社
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
NABLAS株式会社
 
Transformers without Normalization .
Transformers without Normalization        .Transformers without Normalization        .
Transformers without Normalization .
NABLAS株式会社
 
社内勉強会資料_Data-Centric AI in The Age of Large Language Models
社内勉強会資料_Data-Centric AI in The Age of Large Language Models社内勉強会資料_Data-Centric AI in The Age of Large Language Models
社内勉強会資料_Data-Centric AI in The Age of Large Language Models
NABLAS株式会社
 
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue
社内勉強会資料_Moshi_ a speech-text foundation model for real-time dialogue
NABLAS株式会社
 
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
社内勉強会資料_xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
NABLAS株式会社
 
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models
社内勉強会資料_Unsupervised Keypoints from Pretrained Diffusion Models
NABLAS株式会社
 
社内勉強会資料_Pruning in Large Language Models
社内勉強会資料_Pruning in Large Language Models社内勉強会資料_Pruning in Large Language Models
社内勉強会資料_Pruning in Large Language Models
NABLAS株式会社
 
社内勉強会資料_Human-level control through deep reinforcement learning
社内勉強会資料_Human-level control through deep reinforcement learning社内勉強会資料_Human-level control through deep reinforcement learning
社内勉強会資料_Human-level control through deep reinforcement learning
NABLAS株式会社
 
社内勉強会資料_Skywork-MoE .
社内勉強会資料_Skywork-MoE                     .社内勉強会資料_Skywork-MoE                     .
社内勉強会資料_Skywork-MoE .
NABLAS株式会社
 
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRag
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRagRecipe Generation:Retrieval from Videos - Multi-Modal RecipeRag
Recipe Generation:Retrieval from Videos - Multi-Modal RecipeRag
NABLAS株式会社
 
社内勉強会資料_StepByStep Build own RAG. .
社内勉強会資料_StepByStep Build own RAG.       .社内勉強会資料_StepByStep Build own RAG.       .
社内勉強会資料_StepByStep Build own RAG. .
NABLAS株式会社
 
社内勉強会資料_History of LLaVA .
社内勉強会資料_History of LLaVA                .社内勉強会資料_History of LLaVA                .
社内勉強会資料_History of LLaVA .
NABLAS株式会社
 
社内勉強会資料_AnyGPT_Unified Multimodal LLM with Discrete Sequence Modeling
社内勉強会資料_AnyGPT_Unified Multimodal LLM with  Discrete Sequence Modeling社内勉強会資料_AnyGPT_Unified Multimodal LLM with  Discrete Sequence Modeling
社内勉強会資料_AnyGPT_Unified Multimodal LLM with Discrete Sequence Modeling
NABLAS株式会社
 
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
社内勉強会資料_TransNeXt: Robust Foveal Visual Perception for Vision Transformers
NABLAS株式会社
 
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
NABLAS株式会社
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
社内勉強会資料_Two Papers Contribute to Faster Python.pdf
社内勉強会資料_Two Papers Contribute to Faster Python.pdf社内勉強会資料_Two Papers Contribute to Faster Python.pdf
社内勉強会資料_Two Papers Contribute to Faster Python.pdf
NABLAS株式会社
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
NABLAS株式会社
 
Ad

Recently uploaded (20)

Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Ad

π0.5: a Vision-Language-Action Model with Open-World Generalization

  • 1. π0.5 : a Vision-Language-Action Model with Open-World Generalization A paper by
  • 2. Let’s talk about ● Transfusion ○ An architecture mixing autoregression and diffusion with a single Transformer ○ https://arxiv.org/abs/2408.11039 ● π0 ○ A robot foundation model based on Transfusion ○ https://arxiv.org/abs/2410.24164 ● FAST ○ An action representation method ○ https://arxiv.org/abs/2501.09747 ● π0.5 ○ An improved π0 with better embodied reasoning and planning ○ https://www.physicalintelligence.company/download/pi05.pdf
  • 3. Transfusion ● A single Transformer does diffusion and autoregressive token prediction ● At inference, when a BOI token is outputted, the model switches into image diffusion mode ● N tokens of noise are injected and the diffusion process is run ● After, a EOI token is outputted and autoregressive token prediction continues
  • 4. π0 : A Vision-Language-Action Flow Model for General Robot Control ● VLM extended with an action expert ○ Similar to mixture-of-experts with 2 experts and a special routing method ● The VLM processes vision and language instruction ● The action expert uses flow-matching to generate actions ● Both interact through attention ● Trained to predict robot action
  • 5. FAST ● Takes inspiration from image encoding techniques (JPEG) ● Compresses action sequences ● Uses the frequency space to encode the images ● Learn a BPE tokenizer on top
  • 6. π0.5 : A Vision-Language-Action Model with Open-World Generalization ● Same VLM/Action Expert architecture ● First trained to predict VQA style information ○ Uses the FAST tokenizer to predict actions autoregressively ● Then the action-expert is added and the model is post-trained to output continuous actions
  • 8. Interesting bits ● π0.5 can do simple planning which π0 could not do ○ π0 could be combined with other methods ○ Using GPT-4 as a planner doesn’t perform very well ● π0.5 has a stronger focus on cross-environment than π0 ● Evaluations are done with a scoring system that allows to appreciate partial success ○ Many evaluation of robotic system are binary which can be difficult to interpret when the goals are complex
  • 9. Let’s look at the robots
  • 10. How to evaluate VLMs embodied reasoning capabilities? ● Embodied reasoning is becoming more and more popular ● We can use the ERQA benchmark ○ https://github.com/embodiedreasoning/ERQA ○ Comes from Gemini Robotics ● Current scores: Qwen2.5-VL-3B-Instruct