# CogVLM & CogAgent
ð [䏿çREADME](./README_zh.md)
ð **Jump to detailed introduction: [Introduction to CogVLM](#introduction-to-cogvlm)ï¼
ð [Introduction to CogAgent](#introduction-to-cogagent)**
ð For more detailed usage information, please refer to: [CogVLM & CogAgent's technical documentation (in Chinese)](https://zhipu-ai.feishu.cn/wiki/LXQIwqo1OiIVTykMh9Lc3w1Fn7g)
<table>
<tr>
<td>
<h2> CogVLM </h2>
<p> ð Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p>
<p><b>CogVLM</b> is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion visual parameters and 7 billion language parameters, <b>supporting image understanding and multi-turn dialogue with a resolution of 490*490</b>.</p>
<p><b>CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks</b>, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC.</p>
</td>
<td>
<h2> CogAgent </h2>
<p> ð Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p>
<p><b>CogAgent</b> is an open-source visual language model improved based on CogVLM. CogAgent-18B has 11 billion visual parameters and 7 billion language parameters, <b>supporting image understanding at a resolution of 1120*1120</b>. <b>On top of the capabilities of CogVLM, it further possesses GUI image Agent capabilities</b>.</p>
<p> <b>CogAgent-18B achieves state-of-the-art generalist performance on 9 classic cross-modal benchmarks</b>, including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. <b>It significantly surpasses existing models on GUI operation datasets</b> including AITW and Mind2Web.</p>
</td>
</tr>
<tr>
<td colspan="2" align="center">
<p>ð Web Demo for both CogVLM2: <a href="http://36.103.203.44:7861">this link</a></p>
</td>
</tr>
</table>
**Table of Contents**
- [CogVLM \& CogAgent](#cogvlm--cogagent)
- [Release](#release)
- [Get Started](#get-started)
- [Option 1: Inference Using Web Demo.](#option-1-inference-using-web-demo)
- [Option 2ï¼Deploy CogVLM / CogAgent by yourself](#option-2deploy-cogvlm--cogagent-by-yourself)
- [Situation 2.1 CLI (SAT version)](#situation-21-cli-sat-version)
- [Situation 2.2 CLI (Huggingface version)](#situation-22-cli-huggingface-version)
- [Situation 2.3 Web Demo](#situation-23-web-demo)
- [Option 3ï¼Finetuning CogAgent / CogVLM](#option-3finetuning-cogagent--cogvlm)
- [Option 4: OpenAI Vision format](#option-4-openai-vision-format)
- [Hardware requirement](#hardware-requirement)
- [Model checkpoints](#model-checkpoints)
- [Introduction to CogVLM](#introduction-to-cogvlm)
- [Examples](#examples)
- [Introduction to CogAgent](#introduction-to-cogagent)
- [GUI Agent Examples](#gui-agent-examples)
- [Cookbook](#cookbook)
- [Task Prompts](#task-prompts)
- [Which --version to use](#which---version-to-use)
- [FAQ](#faq)
- [License](#license)
- [Citation \& Acknowledgements](#citation--acknowledgements)
## Release
- ð¥ð¥ð¥ **News**: ```2024/5/20```: We released the **next generation of model, [CogVLM2](https://github.com/THUDM/CogVLM2)**, which is based on llama3-8b and on the par of (or better than) GPT-4V in most cases! DOWNLOAD and TRY!
- ð¥ð¥ **News**: ```2024/4/5```: [CogAgent](https://arxiv.org/abs/2312.08914) was selected as a CVPR 2024 Highlights!
- ð¥ **News**: ```2023/12/26```: We have released the [CogVLM-SFT-311K](dataset.md) dataset,
which contains over 150,000 pieces of data that we used for **CogVLM v1.0 only** training. Welcome to follow and use.
- **News**: ```2023/12/18```: **New Web UI Launched!** We have launched a new web UI based on Streamlit,
users can painlessly talk to CogVLM, CogAgent in our UI. Have a better user experience.
- **News**: ```2023/12/15```: **CogAgent Officially Launched!** CogAgent is an image understanding model developed
based on CogVLM. It features **visual-based GUI Agent capabilities** and has further enhancements in image
understanding. It supports image input with a resolution of 1120*1120, and possesses multiple abilities including
multi-turn dialogue with images, GUI Agent, Grounding, and more.
- **News**: ```2023/12/8``` We have updated the checkpoint of cogvlm-grounding-generalist to
cogvlm-grounding-generalist-v1.1, with image augmentation during training, therefore more robust.
See [details](#introduction-to-cogvlm).
- **News**: ```2023/12/7``` CogVLM supports **4-bit quantization** now! You can inference with just **11GB** GPU memory!
- **News**: ```2023/11/20``` We have updated the checkpoint of cogvlm-chat to cogvlm-chat-v1.1, unified the versions of
chat and VQA, and refreshed the SOTA on various datasets. See [details](#introduction-to-cogvlm)
- **News**: ```2023/11/20``` We release **[cogvlm-chat](https://huggingface.co/THUDM/cogvlm-chat-hf)**, **[cogvlm-grounding-generalist](https://huggingface.co/THUDM/cogvlm-grounding-generalist-hf)/[base](https://huggingface.co/THUDM/cogvlm-grounding-base-hf)**, **[cogvlm-base-490](https://huggingface.co/THUDM/cogvlm-base-490-hf)/[224](https://huggingface.co/THUDM/cogvlm-base-224-hf)** on ð¤Huggingface. you can infer with transformers in [a few lines of code](#situation-22-cli-huggingface-version)now!
- ```2023/10/27``` CogVLM bilingual version is available [online](https://chatglm.cn/)! Welcome to try it out!
- ```2023/10/5``` CogVLM-17B releasedã
## Get Started
### Option 1: Inference Using Web Demo.
* Click here to enter [CogVLM2 Demo](http://36.103.203.44:7861/)ã
If you need to use Agent and Grounding functions, please refer to [Cookbook - Task Prompts](#task-prompts)
### Option 2ï¼Deploy CogVLM / CogAgent by yourself
We support two GUIs for model inference, **CLI** and **web demo** . If you want to use it in your python code, it is
easy to modify the CLI scripts for your case.
First, we need to install the dependencies.
```bash
# CUDA >= 11.8
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```
**All code for inference is located under the ``basic_demo/`` directory. Please switch to this directory first before
proceeding with further operations.**
#### Situation 2.1 CLI (SAT version)
Run CLI demo via:
```bash
# CogAgent
python cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16 --stream_chat
python cli_demo_sat.py --from_pretrained cogagent-vqa --version chat_old --bf16 --stream_chat
# CogVLM
python cli_demo_sat.py --from_pretrained cogvlm-chat --version chat_old --bf16 --stream_chat
python cli_demo_sat.py --from_pretrained cogvlm-grounding-generalist --version base --bf16 --stream_chat
```
The program will automatically download the sat model and interact in the command line. You can generate replies by
entering instructions and pressing enter.
Enter `clear` to clear the conversation history and `stop` to stop the program.
We also support model parallel inference, which splits model to multiple (2/4/8) GPUs. `--nproc-per-node=[n]` in the
following command controls the number of used GPUs.
```
torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16
```
- If you want to manually download the weights, you can replace the path after ``--from_pretrained`` with the model
path.
- Our model supports SAT's **4-bit quantization** and **8-bit quantization**.
You can change ``--bf16`` to ``--fp16``, or ``--fp16 --quant 4``, or ``--fp16 --quant 8``.
For example
```bash
python cli_demo_sat.py --from_pretrained cogagent-chat --fp16 --quant 8 --stream_chat
python cli_demo_sat.py --from_pretrained cogvlm-chat-v1.1 --fp
没有合适的资源?快使用搜索试试~ 我知道了~
一种最先进的视觉语言模型(多模态预训练模型)

共67个文件
py:32个
png:12个
md:6个

1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 12 浏览量
2024-08-05
10:00:01
上传
评论
收藏 15.36MB ZIP 举报
温馨提示
一种最先进的视觉语言模型(多模态预训练模型)
资源推荐
资源详情
资源评论





























收起资源包目录
















































































共 67 条
- 1
资源评论


UnknownToKnown
- 粉丝: 1w+
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 网站建设专业知识培训课件.pptx
- 网络在线考试系统项目开发计划.doc
- 电子商务优秀设计方案分析样章总结.doc
- 数据库系统概论-1-绪论.ppt
- CnSTD-Python资源
- 2023年控制器工作站组合全面试用报告基础信息化配件和外设.doc
- 学习]网络营销概论-PowerPointTempla.ppt
- 浅析安装系统前的BIOS设置.docx
- GiteeIOS-Swift资源
- 网络环境下学生学习情况调查问卷.doc
- 计算机应用专业毕业总结.doc
- 基于单片机的烟雾检测报警系统.doc
- 网络检测设备项目可行性研究报告.doc
- 完美版课件第13章Flash网络应用基础.pptx
- 主题班会-----绿色网络篇.ppt
- 互联网医院的方案设计.doc
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
