0% found this document useful (0 votes)
5 views

[English (auto-generated)] [CVPR24 Vision Foundation Model Tutorial] LMMs for Grounding by Haotian Zhang [DownSub.com]

The document discusses advancements in large language models (LLMs) with a focus on multimodal capabilities, particularly in grounding visual features. It highlights issues such as object hallucinations and spatial understanding in current models and suggests that visual grounding could enhance performance and reliability. The presentation also covers various models and frameworks, emphasizing the integration of grounding techniques to improve user interactions and applications in fields like robotics and medical assistance.

Uploaded by

Neeraj Gudipati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

[English (auto-generated)] [CVPR24 Vision Foundation Model Tutorial] LMMs for Grounding by Haotian Zhang [DownSub.com]

The document discusses advancements in large language models (LLMs) with a focus on multimodal capabilities, particularly in grounding visual features. It highlights issues such as object hallucinations and spatial understanding in current models and suggests that visual grounding could enhance performance and reliability. The presentation also covers various models and frameworks, emphasizing the integration of grounding techniques to improve user interactions and applications in fields like robotics and medical assistance.

Uploaded by

Neeraj Gudipati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 42

this is haen so I'm a also a researcher

from Apple uh AIML team so with the same

team that's Jo team uh so today um it's

my great honor to present the topics the

large langage models with the F Grand

grounding

capabilities well uh the previous two

talks TR and J talk they have already

introduced us the recent developments of

the multimodel L link models and also

introduce uh a lot of details about a

series of The state-ofthe-art Works uh

starting from from flamingos into some

recent open source models like the lava

like Sphinx and also the close Source

models so for gp4 as well as a

Gemini so all of these model they follow

the same Trend by inputting an image and

also generating a large model

understanding visual features and also

fattening to the large models and then

the L models will try to interpretate

the input image based on the input text

instructions as well as they can also

produce a textual

response so quite many bunch of these

models follow this way um so they

require the global image as the input so

I refer this as the global image

perceiving multimodal L
models well let's take a look at how

these models

Works um so I simply feed in this year's

conference logo to the multi uh to the

mm1 which yeah I know like plenty of

people haven't access to it but I have

the pleasure so yeah let's take a look

at how uh what what the model says so it

mentions about a lot of things like the

Space Needle And also the word cvpr

which means the model was able to give a

very show us very strong capabilities on

the pattern recognition as well as a

very strong OCR capabilities and also

beyond that because we have the larger

length models as a very strong backbone

and they all the story also looks very

reasonable comparing to the single

captions by the traditional Vision

language

models so do you think you buite if

there everything that you want to see I

don't think so because if you take a a

look at also some models input if you

the image contents become more rich and

then you are you you ask this image for

example like please describe this image

in details and the models were try to

Output a bunch of these texts while it


also says like uh made some uh like

incorrect answers for example like there

say like there are some benches and the

fence in the background which I cannot

see it while if you ask some like the

what is girl in the picture is doing and

then they will say like the other people

is picture around her probably that's

right but uh maybe not a lot of people

will guess guess so so we record this is

the object hallucinations where we think

that object hallucinations are inherited

from the large link models bias so this

hallucination usually we divide as the

category hallucinations the attribute

hallucinations as well as the relation

hallucinations well uh some other things

that are also not perfect so for example

like we know like the reason GPT 4 they

have something called a visual prompt

where you can try to draw a circle

around this image for example in here

it's a bit small but it has a circle

around some part of the mechanical

bicycle um and uh if you draw a circle

around it you ask the model what uh what

what's what's a function of that and the

model tends to generate something but

actually that's not makes a lot of sense

uh it actually uh because we draw about


uh if you know about mechanical bicycle

we actually draws the regions around the

the absor uh like shock absorbers but it

says this is kind of some like a gas

bumpers um I should I deal with

that someone's trying to record a video

Lo yeah these are not

secret okay cool okay thank you uh uh

yeah so some other things like uh if you

want to actually know that like what are

the locations that of the some specific

regions inside them all the outputs and

then we can specifically say like a

capacha example uh you always find it on

a like a Google verifications right you

see like where are the traffic signs in

this image so the chat gbt there

actually produced some uh like a

reasonable text uh but if you draw these

text onto the image you will see like a

they're not making a lot of sense that

bonding box are possibly have some off

offsets and also I don't think that GPT

will understand the bonding box input

actually so um yeah up to here um I

don't think the model Works quite well

but if I actually speak at a NLP

conference I know the Narco is currently

ongoing as well I will say okay you have


such a nice large length models I really

appreciate that but now I'm currently

speaking at the cvpr conference right I

know a lot of people are working on the

vision part um for a vision guy I don't

think I try to buy this because there

are a lot of hallucinations right so I

cannot see a boy actually or a girl

actually in the background and also I

don't really if I want to really not

about the details of them this image and

I cannot find any clues from

it then we start to think um how the

vision plays the RO into the multimodel

logic model models so I think uh for uh

I think to reduce the previous issues we

think the visual grounding may actually

be able to help so for a lot of people

you may not really know the concept of

the visual grounding I guess I can start

from a traditional computer vision task

for example like the localization task

we have the object detection we have the

instant segmentations as well as a lot

of people also work on the vision

language pre-training you're also

familiar with the vision and question

understanding as well as the image

captions so we think uh the vision

grounding is kind of like a bridge


between these and also a lot of

literator they have already proved that

in order to achieve a general purpose

Vision models you have to unify the

localization task as well as the vision

language understanding

tasks so here are a bunch of the

previous work so I will introduce like

three works I start from the glip and

also glp V2 papers so the work

I call them as a unified framework for

the detection and grounding if you look

like the look at the architectures so

the glit models they actually have two

encoders which one of them is the below

one that's swing uh transformer for

encoding the image features as well as

they have a b models which they are uh

like generating text features and they

these features output from encoders and

they go through some multimodel fusion

blocks and after the fusion they're

trying to doesn't

work hello hello che

check check check

check hello

hello hello

hello we can hear you

check check
check hello hello hello okay okay it

works

awesome so uh let's get back to the

topic so we end uh just right at the

multi fusion part multimodel fusion part

so these two features come from the two

different backbones and then they go to

some multimod fusions so after these

fusions so the text part actually comes

from the word in bandings from the bir

models and also the uh the the the

division part actually generates a the

vision bandings from the origion

proposals and all of these features

they're trying to do a contrasted image

text alignment and these feature finally

these features will be used for the

alignment loss as well as the

localization loss so regard this model

as a initial a pineer model for learning

the object level language aware and

semantic Rich representations where it

also benefits from the detection data

and grounding data and also possibly the

studio label web image text datas for

scaling up and these datas together they

can boost the uh grounding

performance so you can see there are a

bunch of task uh the glip and glip two

models they can they can have already


accomplished uh first is the traditional

object detection like Coco and also

segmentations like Coco mask well they

can also do the phrase grounding like a

flicker 30k as well as they can also do

zero sh transfer to some object in the

wild task like object uh object

detection in the wild they can do also

do some very simple reasoning

segmentations they can do fre Stu as

well as the traditional Vision language

understanding like the visual question

answering as well as the Coco

caption well uh there's a followup work

for grounding dyo I think a lot of

people have familiar with this and then

so far is using this tool to do some

open vocabulary object detection that's

lot grounding so I can think of the

grounding Dino as a improved version for

the glip and glip V2 where it mostly

focus on boosting performance we know

like the grounding is actually focuses

on two aspects the first is the semantic

understanding of the input C the second

is the we need to find a bounty box so

it requires a very strong localization

capabilities so for the grounding dyo it

improves it doesn't use the um the retin


net and also the lot of bunch of dynamic

hat as the glip used so they actually

use a very Superior Transformer

architecture detector which is called d

as a better detector to generate the

bount box and second they also in their

literatures they also find uh we have if

you have more or better image text

feature Fusion that this will also lead

to a more understanding of the semantic

meanings so uh incorporating these tools

so this will lead to a very Superior

performance for

gred and later uh there's a word called

steam which the full name is stent

anything anywhere all at once a very

Global Pictures right yeah so it's I

consider it as a single generous

approach for a pixel level understanding

so it doesn't actually output the

origion um like a bounding box fe uh

outputs but it actually focuses on more

like semantic like a pixel level

understanding like a segmentation masks

so for the scam uh it actually introduce

a bunch of the visual prompts so it

doesn't only output the sem IC mous or

the bnny boxes it and also you can also

take a set of the visual prompts as the

inputs uh which are not the text inputs


you can take in the points you can take

in the Box you can also do the scles you

can take whatever free form shape into

it and as well as they also introduced

something called the memory prompt

because they're using a decoder

architectures so this can enable them to

interact with the users in the multi

because the previous input they have

some memories into

it of course I I I just introduced that

they actually can also give a semantic

labels to any predicted segmentation Tas

instead of only outputting the boning

box well uh these are the work actually

be uh before this year until 2023 so we

can think see the trend actually becomes

um yeah I've listed a bunch of the

related work I only cover three of them

but you can see like at least there a

tiny difference between them they you

can uh they're using a more like

external detector to improve the

localization performance as well as it

also needs to understand the querious

input so that's why they need also the

strong encoders especially for the text

encoders uh uh they are generating from

the pure bird uh like kind of like self


attentions towards the encoder like uh

models and also the third thing is about

this area is everybody is focusing on

unifications so the unification here I

listed only inputs and outputs where I

mean like you can feeding the different

visual prompts as well as generating the

free form uh like outputs Bing box

oration mask but also like there is a

unification trend on the architecture as

well you can see like a lot of lot of

like uh uh Works they uh throw away the

convolution architectures and then also

using the Transformers as the

mainstream so uh I've covered several

works so let's possibly take a short

break and also give some give us some

time to think so what we have heard so

far first of all we have SE the typical

Vision language models or the multimodal

language models actually suffer from two

very severe issues which is the object

hallucinations as well as the spatial

understanding and also we think the

transitional Vision ground concept maybe

elevated this above issues and third

thing that I have listed at several

existing work and also I introduced a

bit details about them and some trends

like ongoing works but the entire uh


field is changing really fast with the

Breakthrough of the large L models

actually I see like many people are

moving they trained from the traditional

Transformers and then we have the L

model

so the thing has become like how the

grounding uh will benefit the log models

what the things like what Magics can

actually combine in the grounding with

the L models can bring us and how that

the most important thing is how to

integrate the grounding into the logic

models so first of first of all I want

to introduce like why um we think we

have a potential like by combining the

spal understanding with the multimodal

models the first of all of course it

enables some new functions right so

because we have the models as we have

the we can have the inputs as regions

and inputs so user will be able to spec

specify to which regions or which

objects and they can ask the models for

help and also because sometimes texture

output only have plain text but we

actually don't know which object or

which things they actually in the image

are talking about so we also need the


models to localize or ground particular

objects in the response to help the user

to find them or to guide them and of

course the grounding capabilities can

also help to generate a much better

Vision language models for for instance

like uh they can because they have the

Bony box output on all the things they

have mentioned so it reduce the

hallucination possibilities in the tax

outputs and seconding it becomes more

trustworthy because we think that

because we have the grounded evidence of

what things are talking about and third

is that we are having the more open

babary concept because with the very

very large scale lar models PR training

we have more concept like by just using

the maybe the contrasted flaws or other

different uh type of the pre

trainings uh third thing that I want to

mention is that it also enables a lot of

new applications the one thing I can

think of the like the phone the VR our

version Pro right app intelligence right

recently here and second is the 3D

embodiment um especially these grounding

things can help the robot to guide thems

and also uh try to find try to grab

whatever they want and third thing is


about the medical assistant I think TR

has already mentioned lava mat

works so um seems like we combined with

the multimodel large range models so our

problem will be slightly different from

the traditional visual

concept so I regard the special

understanding into the two different

tasks the first is called referring so

in the multimodel L models because we

only can enable the text as input or the

tokens as the input so the input

representation I form it into like we

have the image tokens we have Tex

instructions tokens as well as we can

also have some rig tokens it can either

become the continuous features from the

Tex uh aition encoders or can just be

become some discrete

coordinates and so the model is required

to understand the referred regions and

also respond to the instructions so the

input regions can be different types

just like what we have we have seen in

the scene paper it can be a point it can

be some bonding boxes or it can be some

sbles if you see this example you ask

like what is in the region and what is

it used for so previous theme cannot be


able to understand because it doesn't

have the L model as the output but if we

combine the grounding with the Ling

models so this will only trigger the

multimodel reasoning capabilities so the

models will generate more natural styles

like when we communicate with some

models uh yeah of course like the second

example shows like which movie

characters are in region one and region

two U traditional mod Vis language model

cannot do that because they have not

seen Harter Daniel Ric CA and yeah Amani

in in in in a her

quarter and second thing is about

outputs so we refer that to the

grounding so the upper repr

representation is way simpler because it

can just represents as a Tex response by

having the regions into the textual

response so the model is required to

localize the objects in the image when

also mentioning them in the response for

example different from a text model if I

want to ask how to make a sandwich the

text model will have different kinds of

outputs and possibly you don't have all

of these food or have these ingredients

available but if you have a grounding

multimodal logic models you ask the


model how to make the sandwich with

available ingredients provided in this

image so the model will first recognize

all of these things and then generate

basically like the the things upon that

right and and also it will tell you

which exactly things are with

so there are a bunch of Works um for f

Grand region level multimodal models

which are different from the global

image perceiving multimodal L models as

we previously mentioned so there are

several works like gbd4 Roi we have fit

we have Cosmos 2 we have shikura and

bunch of works so these things in

addition to the image and text as input

we also have the region or the pixel as

input and as as well uh for the output

we can also generate the regions as well

as the text

simultaneously so I want to introduce

some two pioner words which are the

first two wordss that implement the kind

of like the concept the first one is

called Cosmos 2 a lot of people have

familiar with this the Cosmos 2 actually

focus on branding you can see they have

the location tokens as input as well as

they also have location tokens of the


output um so beside that they are adding

some kind of like new vocabularies aside

of the original tax tokens which are

inhered from the log models so because

they also need to representations need

to learn a very well representations for

these location tokens that's why they

constructed a web scale grounded image

text pair which they also construct have

a lot of like regions instruction into

the original data they constru about 19

million datas into it and they Trend

this model turns out it works pretty

well and second thing is about shikra

the shikra is I think it's a concurrent

work with the Cosmos 2 so what the sh

does the ground in is slightly bit uh

slightly different from the Cosmos 2 it

also has the grounding as output but

instead of using the location tokens it

has you can see like 0.392 those kind of

like a traditional like like numerical

numbers into the is similar as the what

the text outputs that coming from the L

models so the T use they also shows a

very interesting study by comparing

using the location tokens as well as

using pure numerical numbers they find

like if you just directly use the

numerical numbers you will not decrease


the performance but actually they also

find that because we don't have

additional VAB to be learned so that's

why you can just directly use a

numerical numbers that can save you a

lot lot of training time as well as you

don't need to construct this web scale

grounded image Tex first they also

generate some more elegant

representations well another work is

called fet and the fet is I can think of

as a more unified models where we not

only have the discrete uh location

tokens or the text tokens as input but

we can actually enable the different

features or different kind of like PRS

into the models so we Define the hyperd

representations as the orig name plus

the discrete tokens as well as the

continuous features so for the we know

like for the point box we can represent

as the discreete coordinates but what if

the user actually draw something like

for example the tail of the cat um you

need some other things to represent U

the the coordinates or better represent

these features

another more severe case is that on the

right side you have kind of like a gun


on the on the left side and also a knife

on the right side um but if you use the

pure coordinates I mean the bonding box

coordinates these box were very likely

to be overla and Trigger as the same but

in addition to that we also introduce a

continuous features where these features

will be more like a align to the thing

that like the people have referring

to yeah for discrete tokens uh we

represent the point as the pure XY

points also the Box we can free form

shape and also they have a bounty box

attached to it so it's called XY X1 y1

X2 Y2 which use the top left and also

the bottom right points so we tokenize

these just similar as what shik do we

tokenize them by the LG model tokenizer

and we we feed into models and second

thing is the continuous visual visual

features which is different from the

previous two work

so we introduce a visual sampler which

they be able to extract and also

summarize the visual features of the

referred regions we we actually from

point to the circle and then we flatten

them into a single factor embeddings and

then we actually input them where's the

per things so the example of this data


you can see the regions uh if we ask the

question what is the region and you have

uh discrete uh set of the tokens

coordinates as

we have a special token to represent as

the features so the output is set of the

grounding we have the like the B box as

the

output so for the model architecture we

slightly modified on the lava style

architecture where we use the image

encoder as the clip V large large 14 and

then L models we used open source model

called The vuna V 1.3 at that time and

uh we also have have a specific module

which we call the special aware visual

Samplers and this model will try to uh

take in the hybrid representations and

also generate the grounding outputs so

we use the next token predition as the

loss because we don't introduce any

other recoveries so basically we also

fix the image encoders and also update

the rest of these ARS of

modules so uh yeah I just introduce a

bit details about how our spatial Weare

visual Samplers are constructed uh we

use something like very similar to the

point Nets so we actually first sample


512 points inside the region features

inside the regions from the feature map

coming from the clip features and then

we have two blocks which we go through

each one by one and then we first down

sample these numbers of points and then

we find the K near Neighbors about these

censors and then we because we think

like the center of the points will make

more weights contributes to the final

results instead of like surrounding

features so we also fuse these neighbor

features and then we do a like like

pulling so finally we have uh like a

down sample from 512 towards 32 points

and then we linearly projected these

features into the LM ining space and

this space will be concatenated with the

this tokens and then feeding to electon

models so to train this data set um

these are slightly different from the

lava type of the data sets which are not

exactly the pure image and text pairs uh

we introduced this great data set which

we c as a ground and refer instruction

tuning data sets but this is a more

hierarchical more unified format and

also it has more bunch of instruction

following as well as some samples for

the robustness so total are not that too


large it only consists about 1.1

Millions which they are mostly focused

on object levels where you have the

input as the regions and output as the

regions and also you have relationships

you can have bonding multiple bonding

boxes you ask about the relationship

between them about regions what are

something like not objects but actually

the region parts and then D recently we

actually use a gp4 to curate these datas

where it can also impact some Reon into

the region local region Concepts and

finally the robustness sometimes if the

object is not exist in this image we ask

the model to Output

no yeah here are some few short examples

so um it's different from the lava style

we not only feed in a pure bonding box

as well as what the objects but we are

also feeding in the region descriptions

as well as we have like like the

relationships and then we trigger the

chbt TR they're just everything was

formatted into the text and then we

trigger the chbt to give a set of their

conversations and we use the set of

these conversation as a ground shoes to

train our
models and another thing I want to

mention is about the spatial negative

mining so we have two type of the

negative datas which first is weord the

image condition category localizations

where we ask the model to localize and

class that are in the common vocabulary

but actually not in this image for

example if there a cat in this image

because there's a dock so we have some

modif output no there is only there's no

cat but there is a dog in this image and

second thing is the semantic condition

category localizations so we actually

want the edings to be slightly different

for example if the dog is a like a um

not a hus key or not a gold retriever we

actually ask at the model if there a

gold retriever in the IM

possibly all of the models were tried

other gring models were try to localize

okay I found the bandings very similar

to the doc here so I just gave it a bon

box but that's not ideal so we ask the

model to localize a object class that is

thematically Clos to an existing object

in the

S well uh for the evaluation part we

have the conventional test as well as we

have open-ended kind of like evaluation


so for the referring test ask we ask the

model to we first circle out the regions

and then we ask the model to predict the

class of it we can see how accurate the

models will we want to evaluate how

accurate the models will try to predict

but we also have the three type of

formats we have the box as input we have

the point as a test like also the free

from shape as a

test so we use Elvis because Elvis has

such a large from the cavalary sets um

so the random guess about the point box

in free form is a binary classification

so that's around by 50 and you can see

comparing to other models I have

mentioned like g4i Cosmos 2 and chup

they are able to take the discrete input

but the fet at the time and out perform

these

models we also have the conventional

object grounding or phe grounding sets

which we focus on the graphical go

sets and also the flicker 30k first

rounding

sets at that time comparing with the

transitional models like um like ner

like unit type those models and also the

sh models the fet out performs


them and another thing because

previously we mentioned about the vision

task like conventional Vision task for

referring grounding so here I also

introduce because the with the help of

the Ling models it can also do very

interesting reasoning capabilities so we

want to mostly focus on the fine Grand

reasoning capabilities so we introduce

these three task about we call the

referring description so we kind of like

a describe a a referred region and based

on its interactions with the surrounding

objects we also have another evaluation

task for referring reasoning and

actually reason on top of of one or more

referred regions whether it whether it's

correct or not and and also the models

are able to Output grounding in the

conversations we want to see like

whether these outputs are also the Box

are related to the to the object they

have

mentioned well you can see like uh

qualitative examples of the output uh

for example if I have this circle if I

have this Square what is the purpose of

the object and I also try to like a give

a bonding box around it the grun jues is

generated by the when we input the the


the the B box as well as the region

description into gp4 and it will

generate this object is a box where it

also send reasoning out of the image

like a box I typically used for whatever

things and if you feed into lava because

lava doesn't have any spal understanding

so it just basically hallucinates and

for Cosmos 2o it can accept the

referring the say but it's not really

accurate it says the purpose the object

is to track it bir to the table which is

not exactly right and also the shika

also does not make sense at all it says

to the thing is to keep the birds away

the way our Fair say the object is a

bottle which generally used to store or

dispense the liquids like water juice or

other beverage which very close to

ground ground juice

outputs yeah we also compare with at

that time we also compare with the gb4 V

we see like the F on the referring gring

task they actually outperform the GPT 4B

because GPT 4B at the time doesn't have

any SP understanding at all uh if you

refer to for the referring we actually

also show this mechanical bike examples

uh for the possibly the larger areas if


we ask the gbd 4V it will try to give a

very possible answers but very very

small regions the one I just show like

on the shock sber uh doesn't really

produce a very reasonable outputs as

well as this capacha where I think the

AI things investigate here so if you ask

where are the traffic lights in output

the ver will try to give a set of these

discrete outputs and then if you try to

draw them into this image you can see it

finds most of these image and yeah you

can click them and and to avoid and to

pass this kind of

verifications well recently we

introduced another work which is

a okay which is a improved Baseline for

the referring and grounding w l models

so it shows a substantial improvements

over the fine Grand region levels as

well as we can also see the region level

fe uh benefits can also be transferred

to the global Image level test uh we

have some Benchmark performance com

comparing with the V2 is the V1 it has

show various substantial improvements uh

for the left qualitative examples for

the referring you can see if I refer to

a like a origion on the on the left

side yeah I think that the mic doesn't


work hello

hello okay perfect thank you thank you

um yeah for the referring if we refer a

very very small Origins on the left side

I'm not sure if you can see it very

clear if you ask for E1 models you will

say some app which is slightly I think

it's correct but slightly think there

are some offset but actually the the

word is the the the great V2 are able to

predict it and also grounding if you ask

some models like is there any way like

can make the people cool in the summer

and the third uh I think it's able to

detect those Vehicles is that vehicle

has air condition but that's not what

people wants right so the F2 will

generate more precise branding outputs

where they can find these very very

small and conditioners on the buildings

and then they are able to find all of

them you can say like uh where's the air

conditioners people's are can make it

cool during the

summer so how we achiev that I think

first of all U because we have

introduced a bunch of works on the

traditional Vision language pre-training

so um we know like there are two things


very critical the first of all is that

we want to have a better detector right

but because we are actually using the uh

the text pure text outputs we don't have

any embeddings we cannot contaminate

with any hats uh this unifies

architecture and becomes more elevant so

then another way is just to scale up

image Solutions I think this is mostly

critical to the multimodel AR models

areas so the resolution we know that

resolution is even more critical

comparing to the find uh like a global

Grand task so um typically like with

current l models we can actually do two

ways the first of four I call the direct

upsampling which the way of doing that

is that you interpolate you have a

larger image as input and then you

interpolate because clip is trained on

24 through f as a PR train models you

just kind of like doing the sampling by

interpretting the position endings and

then during the training you try to by

product towards the vision coders and

trying to train this pre-trained uh like

a find this pre trained image encoders

there is another way like we you can

either fix the image backbone or and

freeze it but you can use any


resolutions as um TR and also J they

have introduced you kind of like split

this image into multiple patches and you

feed each patches into the clip encoders

and you generate all of these Global

patches as well as the sub patches and

you can them and you're feeding into

Ling models

simultaneously so we did a um solid

analysis on these we choose the

referring task we choose the grounding

task as well as some task uh Global task

but require some fer understanding for

example the text vqa as well as we have

open-ended Fair bench we try to see uh

which one is better than which one and

through our studies we find generally if

you want to do the scale up for the L

models I believe true op performs one a

lot but this also makes sense right

because if you try to F tune the image

backbone you will try to uh I mean

destruct the semantic pre-training

concept from the clip models so this

well actually hurs a lot of performance

when you're transferring to a open ended

Di and second is because in the F the

grounding is just purely D it's the

outputs into the into the pure text


format so everything is dealing by the

larg models so this can if you do that

you can effortlessly combine with the

Runing another thing about is the

multi-grain larity Vision codings um the

vision people like to study about

whether um previously we used the clip

encoder U I think for most of these work

but if there anything like other Bing

back bones may may be able to help we

try to switch the clip encoder with the

NY V2 encoders but purely maintain the

same architecture as the F but doesn't

make a lot of sense where so we think

about a reason why the Dy 2N clip

doesn't comparing clip on these Tas but

actually with the natural scaling we are

able to merge these T features together

so because due to the any resolution

scaling we actually divided these into

the different pches but we find like due

to the clip training it used the sentent

prop and also due to its pre-training it

only sees the it sees the global image

as well as whole text so we think the

clip encoder is more benefit we're

encoding the global patch but for the

split patch actually if you think about

the clip encoder actually in the

pre-training it doesn't have these kind


of like different tops and also doesn't

see any partial images does the clip

work that well um possibly not but

because of Dino V2 because they have

different argumentations during the

pre-training as well as their in a more

self-supervised way they can trigger for

localizations of these tasks and then we

combine the we use the D2 to encode the

sub patches and finally we use the

global patches concatenated with the sub

patches and then all of these will PR

into lar models so this is kind of like

a specific design of the

fab2 and third thing because we have an

additional N2 encoders so where can we

get we know that there's a projector

between the encoder as well as L models

so how to train this projectors so

previously in the we only have the first

stage which is uh like alignment stage

the third stage which is more like

instruction following stage but for

fair2 we introduce another stage we call

the high resolution dense alignment

because we think the transition from the

low resolution per training to a high

resolution per training actually you

need some some data or some phrase to


adapt to it and second thing is that

because the projector two which aligns

the D V2 encoder as well as models need

some

internalizations so we find like so we

use the data from the Lis stce um kind

of like captioning or referring uh we

use so for example we constructed this

Dan referring data we have um the please

classify the objects in the following

locations and see we have bunch of these

dance regions and then the answer what

we typically we have the classification

labels containing into the text as well

as we also have the dense grounding so

we want to the question will be please

localize all the visible objects in the

image in a roster order uh like a scan

order so this answer we constructed all

the objects are the first um for example

cat we give a coordinates second a dog

we also attach a coordinates to it so we

um there are several benefits I have

already mentioned it can train the

projection for the projector too it can

also have a more uh like a like a smooth

transition between the low resolution

and high resolution pre-training as well

as because we have additional spatial

where visual Samplers it needs to have


the continuous features to be able to

train so the second stage also we can

have a better initializations for the

special where Vis

Anders uh by looking at this performance

of course we compare with even stronger

grounding models this time we also

include the grounding dyo which is

previously the state-of-the-art models

for the phas groundings uh we we

actually out performs it and also under

on the referring conventional Benchmark

we only have the Elvis but due to the

Elvis they only have like 600 or 800

image resolutions but we want to see

like if fab2 can actually find really

really small objects given even higher

resolutions so we take image from the sa

1B which are used for the same like a

segmentation purposes so we actually on

the a referring Benchmark we also can

get a very substantial improvements on

that um uh thirdly we also um transfer

this F gr t to the General task we can

also see the fine brand task also have a

lot of benefits for these conventional

uh like like multimodel mod model

Benchmark as

well all right so I I have mentioned


about a lot of like a region level

multimodel L models of course there are

other F Grand pixel level multimodel lar

models um so here I introduced the

Vistar large language models which it's

kind of like not only predicting the B

box but also predict the mask as well

but the mask is kind of very tricky

because the mask we are and if you

represent the mask as some kind of like

polygons uh you will have very very long

context uh but the LM is introduced in a

very interesting way called adaptive

sampling they're trying to uh like kind

of like a down sample the points of

these polygons and they recover it back

so they're also using the text as the

output but they're representing the mask

as a shorter

context another work which I think is

concurrent work with the thism which is

for the Lisa as well as the Glen so

these two works also focusing on the

pixel level outputs for the centic mask

uh but the difference is that uh so it's

not using the text uh like a sequence of

these text to decode the mask but that's

actually using some uh to train decoder

from the SC so with the large model

outputs you can have those embeddings so


these these embeddings can be actually

effortlessly combined with the same

decoder you can think of as more like

kind of like a prompt from the large

length models so this prompt comparing

to just the visual prompts it has

possibly has more reasoning because it

combines with the large models so these

two models will also be able to perform

some reasoning

segmentations and finally I want to

introduce some video based region level

multimodel models um I think the the

trend is kind of like evolving many

people are focusing on image and also

they they can transfer the image to the

multif frames or to the video part so

there's a word called the PG first

grounding video large lava um so

different from the traditional models or

the image lar multimodel L models and

they kind of like having the multiple

frames as the input so how they extract

the feature because they have multiple

frames so they're applying some they use

some video encoder or video encoder they

apply some spatial or temporal pooling

try to reduce the multiple frame tokens

and and then after all when they


generate in the edings they also have

entity met modules where these entity

met modules is very similar to the blood

and also the grounding dyo where they

have the response they extract the

tokens or they also have the the

features when they try to M them

together another thing they did is that

because the video always comes from

always comes with the audio inputs so

they also have those aligned audio

inputs as

well well uh I have introduced a bunch

of these uh models they work really

great and they're really fancy models

but how these models will be able to

benefit the people's living in the world

so as a guy from Apple we care about the

new applications for the UI

understanding we think this capabilities

will enable the more interactive and

more intelligent UI uh experience for

the users so different from different

from many many models they're working on

the natural image but this time we focus

this on the on the on the screen

screenshots but these screenshots can

actually come in from the Android or com

in from the iPhone so we have those

annotations uh we divide them some


Elementary task as as well as an

advanced

task so we have a model cord refer UI

which is actually work based on the F

you can see a bunch of these Tas of fa

UI are able to do we think that if you

user t on some regions it will possibly

like generate some outputs from that and

also the output have some ground uh the

UI also have some grounding capabilities

well if you want to accomplish some task

you need to find the specific button of

specific actions to achieve that you are

able to do that and also regarding the

advanced task you can also think if I

want to purchase something uh what

things should I do step by step and the

thir UI will also be accomplished on

those

tasks so we think this um multi model

log models um I think this kind of

understanding is different from the text

but where you can simply input this meta

information from the websites or from

the screenshots and you send into it but

sometimes we find like it's not not

ideal because for for for example like a

like a shopping apps or some

like podcast there are some image


contents where you also need to get an

understanding of both image contents

with the buttons together and where

these contents are not always available

through the

metadata so there's another example you

can see how the T UI works on the on the

on the yeah on the Apple on iPhone

screen so besides the UI understanding I

think there's also one possibility train

that we can Leverage The 2D gring models

to 3D and a lot of Works have already

started um like uh the 3D llm I guess

that's a pioneering work they can do the

3D cing 3D grinding as well as a bunch

of these tasks where they also involve

some involing AI tasks like op

navigation visual language navigation

EMB body QA a lot of stuff I think this

is Alo a future train because actually

we want to study the whole like spatial

intelligence as Recon proposed right so

I think the spatial intelligence

requires a really great spatial

understanding and this I think the 3D

and our embodiments can benefit from

that a

lot yeah finally there's our strong

conclusions so for the takeaway message

uh that summarize what what have been


covered so I think personally I think

the referring and the gring cap

abilities also the per task of the

recent multimodel models and also we

introduced a bunch of different kinds of

the fine gr multimod lement models where

they are focusing on the spal on 2D you

can have the points boxes or mask as

inputs or outputs and also some video

temp grounding like you can we can also

focus on the multi frames or

videos and third thing is that uh if you

enable the referring grounding it can

actually enable a lot of new

applications as well like the thing that

UI has demonstrated as well as some 3D

gruning models they also

demonstrated I think the future

challenge actually um still a lot of

things to work on so for example because

if we want to find really really small

objects uh given such a higher

resolutions because if you split um

split the patches and then feed into LM

we always have a predefined context

window models you cannot feed in

whatever uh like how F Grant the the the

visual features are so definitely I

think uh the high resolution need to be


supported but it need to be supported in

a more cleer way uh with the context of

the lar models and also during because

we have more larger context windows we

have more image tokens involved and also

the training efficiency will become very

critical and another thing I haven't

listed here but I think is really

important is that um because I think so

far we C the multimodel larg link models

it's because we have a very large

training language model as a backbone

but for a vision guy I think we should

always think about how the vision can be

actually compatible with the lar L

models I don't think the lar L models

will be in P over anything everything uh

so in the afternoon talk I think genway

we have we'll have a very detailed look

into the vision how the vision helps the

multimodal aric models stay tuned and

fourth thing is that um yeah some task

actually requires the fine Grant

understanding where sometimes possibly

not if I just want to ask what year is

this person born there's not a lot of

things to be grounded so I think a more

Super alignment way of triggering this

grounding capability is also a very

interesting uh topic to be studied


hey I guess that's everything I want to

cover today thank you

You might also like