DeepSeek 多模态模型 Janus-Pro 本地部署-CSDN博客

一、概述

Janus-Pro是DeepSeek最新开源的多模态模型，是一种新颖的自回归框架，统一了多模态理解和生成。通过将视觉编码解耦为独立的路径，同时仍然使用单一的、统一的变压器架构进行处理，该框架解决了先前方法的局限性。这种解耦不仅缓解了视觉编码器在理解和生成中的角色冲突，还增强了框架的灵活性。Janus-Pro 超过了以前的统一模型，并且匹配或超过了特定任务模型的性能。

代码链接：https://github.com/deepseek-ai/Janus

模型链接：https://modelscope.cn/collections/Janus-Pro-0f5e48f6b96047

体验页面：https://modelscope.cn/studios/AI-ModelScope/Janus-Pro-7B

二、虚拟环境

环境说明

本文使用WSL2运行的ubuntu系统来进行演示，参考链接：https://www.cnblogs.com/xiao987334176/p/18864140

创建虚拟环境

conda create --name vll-Janus-Pro-7B python=3.12.7

激活虚拟环境，执行命令：

conda activate vll-Janus-Pro-7B

查看CUDA版本，执行命令：

# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:20:09_PST_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0

三、安装Janus-Pro

硬件要求

显存至少8G以上，推荐16G

内存，至少16G以上，推荐32G

CPU，最低4核，推荐8核

硬盘，最好50G，推荐500G

以上条件，我已经满足了

显卡，RTX 5080，显存16G

内存，64G，DDR5

cpu，24核32线程

硬盘，2TB固态硬盘

安装

创建项目目录

mkdir vllm
cd vllm

克隆代码

git clone https://github.com/deepseek-ai/Janus

安装依赖包，注意：这里要手动安装pytorch，指定版本。

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

安装其他依赖组件

pip3 install transformers attrdict einops timm

下载模型

可以用modelscope下载，安装modelscope，命令如下：

pip install modelscope

modelscope download --model deepseek-ai/Janus-Pro-7B

效果如下：

# modelscope download --model deepseek-ai/Janus-Pro-7B
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/deepseek-ai/Janus-Pro-7B
Downloading [config.json]: 100%|███████████████████████████████████████████████████| 1.42k/1.42k [00:00<00:00, 5.29kB/s]
Downloading [configuration.json]: 100%|████████████████████████████████████████████████| 68.0/68.0 [00:00<00:00, 221B/s]
Downloading [README.md]: 100%|█████████████████████████████████████████████████████| 2.49k/2.49k [00:00<00:00, 7.20kB/s]
Downloading [processor_config.json]: 100%|███████████████████████████████████████████████| 210/210 [00:00<00:00, 590B/s]
Downloading [janus_pro_teaser1.png]: 100%|██████████████████████████████████████████| 95.7k/95.7k [00:00<00:00, 267kB/s]
Downloading [preprocessor_config.json]: 100%|████████████████████████████████████████████| 346/346 [00:00<00:00, 867B/s]
Downloading [janus_pro_teaser2.png]: 100%|███████████████████████████████████████████| 518k/518k [00:00<00:00, 1.18MB/s]
Downloading [special_tokens_map.json]: 100%|███████████████████████████████████████████| 344/344 [00:00<00:00, 1.50kB/s]
Downloading [tokenizer_config.json]: 100%|███████████████████████████████████████████████| 285/285 [00:00<00:00, 926B/s]
Downloading [pytorch_model.bin]:   0%|▏                                            | 16.0M/3.89G [00:00<03:55, 17.7MB/s]
Downloading [tokenizer.json]: 100%|████████████████████████████████████████████████| 4.50M/4.50M [00:00<00:00, 6.55MB/s]
Processing 11 items:  91%|█████████████████████████████████████████████████████▋     | 10.0/11.0 [00:19<00:00, 14.1it/s]
Downloading [pytorch_model.bin]: 100%|█████████████████████████████████████████████| 3.89G/3.89G [09:18<00:00, 7.48MB/s]
Processing 11 items: 100%|███████████████████████████████████████████████████████████| 11.0/11.0 [09:24<00:00, 51.3s/it]

可以看到下载目录为/root/.cache/modelscope/hub/models/deepseek-ai/Janus-Pro-7B

把下载的模型移动到vllm目录里面

mv /root/.cache/modelscope/hub/models/deepseek-ai /home/xiao/vllm

四、测试图片理解

vllm目录有2个文件夹，结构如下：

# ll
total 20
drwxr-xr-x 4 root root 4096 May  8 18:59 ./
drwxr-x--- 5 xiao xiao 4096 May  8 14:50 ../
drwxr-xr-x 8 root root 4096 May  8 18:59 Janus/
drwxr-xr-x 4 root root 4096 May  8 16:01 deepseek-ai/

进入deepseek-ai目录，会看到一个文件夹Janus-Pro-7B

这个就是我们下载的大模型文件，等会会需要python代码来调用

# ll
total 16
drwxr-xr-x 4 root root 4096 May  8 16:01 ./
drwxr-xr-x 4 root root 4096 May  8 18:59 ../
drwxr-xr-x 2 root root 4096 May  7 18:32 Janus-Pro-7B/

返回上一级，在Janus目录，创建image_understanding.py文件，代码如下：

import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images

model_path = "../deepseek-ai/Janus-Pro-7B" image='aa.jpeg' question='请说明一下这张图片' vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path) tokenizer = vl_chat_processor.tokenizer vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True ) vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval() conversation = [ { "role": "<|User|>", "content": f"<image_placeholder>\n{question}", "images": [image], }, {"role": "<|Assistant|>", "content": ""}, ] # load images and prepare for inputs pil_images = load_pil_images(conversation) prepare_inputs = vl_chat_processor( conversations=conversation, images=pil_images, force_batchify=True ).to(vl_gpt.device) # # run image encoder to get the image embeddings inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs) # # run the model to get the response outputs = vl_gpt.language_model.generate( inputs_embeds=inputs_embeds, attention_mask=prepare_inputs.attention_mask, pad_token_id=tokenizer.eos_token_id, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id, max_new_tokens=512, do_sample=False, use_cache=True, ) answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True) print(f"{prepare_inputs['sft_format'][0]}", answer)

注意：根据实际情况，修改 model_path，image这2个参数即可，其他的不需要改动。

下载一张图片，地址：https://pics6.baidu.com/feed/09fa513d269759ee74c8d049640fcc1b6f22df9e.jpeg

将此图片，重命名为aa.jpeg，存放在Janus目录

最终Janus目录，文件如下：

# ll
total 2976
drwxr-xr-x 8 root root    4096 May  8 18:59 ./
drwxr-xr-x 4 root root    4096 May  8 18:59 ../
drwxr-xr-x 8 root root    4096 May  7 18:11 .git/
-rw-r--r-- 1 root root     115 May  7 18:11 .gitattributes
-rw-r--r-- 1 root root    7301 May  7 18:11 .gitignore -rw-r--r-- 1 root root 1065 May 7 18:11 LICENSE-CODE -rw-r--r-- 1 root root 13718 May 7 18:11 LICENSE-MODEL -rw-r--r-- 1 root root 3069 May 7 18:11 Makefile -rwxr-xr-x 1 root root 26781 May 7 18:11 README.md* -rw-r--r-- 1 root root 62816 May 8 14:59 aa.jpeg drwxr-xr-x 2 root root 4096 May 7 18:11 demo/ drwxr-xr-x 2 root root 4096 May 8 17:19 generated_samples/ -rw-r--r-- 1 root root 4515 May 7 18:11 generation_inference.py -rw-r--r-- 1 xiao xiao 4066 May 8 18:50 image_generation.py -rw-r--r-- 1 root root 1594 May 8 18:58 image_understanding.py drwxr-xr-x 2 root root 4096 May 7 18:11 images/ -rw-r--r-- 1 root root 2642 May 7 18:11 inference.py -rw-r--r-- 1 root root 5188 May 7 18:11 interactivechat.py drwxr-xr-x 6 root root 4096 May 7 19:01 janus/ drwxr-xr-x 2 root root 4096 May 7 18:11 janus.egg-info/ -rw-r--r-- 1 root root 2846268 May 7 18:11 janus_pro_tech_report.pdf -rw-r--r-- 1 root root 1111 May 7 18:11 pyproject.toml -rw-r--r-- 1 root root 278 May 7 18:11 requirements.txt

运行代码，效果如下：

# python image_understanding.py
Python version is above 3.10, patching the collections module.
/root/anaconda3/envs/vll-Janus-Pro-7B/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:604: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
warnings.warn(
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:10<00:00, 5.18s/it]
You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.

<|User|>: <image_placeholder>
请说明一下这张图片

<|Assistant|>: 这张图片展示了一位身穿传统服饰的女性，她正坐在户外，双手合十，闭着眼睛，似乎在进行冥想或祈祷。背景是绿色的树木和植物，阳光透过树叶洒在她的身上，营造出一种宁静、祥和的氛围。她的服装以淡雅的白色和粉色为主，带有精致的花纹，整体风格非常优雅。

描述还是比较准确的

五、测试图片生成

在Janus目录，新建image_generation.py脚本，代码如下：

import os
import torch
import numpy as np
from PIL import Image
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor

model_path = "../deepseek-ai/Janus-Pro-7B"
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {"role": "<|User|>", "content": "超写实8K渲染，一位具有东方古典美的中国女性，瓜子脸，西昌的眉毛如弯弯的月牙，双眼明亮而深邃，犹如夜空中闪烁的星星。高挺的鼻梁，樱桃小嘴微微上扬，透露出一丝诱人的微笑。她的头发如黑色的瀑布般垂直落在减胖两侧，微风轻轻浮动发色。肌肤白皙如雪，在阳光下泛着微微的光泽。她身着乙烯白色的透薄如纱的连衣裙，裙摆在海风中轻轻飘动。"},
    {"role": "<|Assistant|>", "content": ""},
]

sft_format = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(
    conversations=conversation,
    sft_format=vl_chat_processor.sft_format,
    system_prompt=""
)
prompt = sft_format + vl_chat_processor.image_start_tag

@torch.inference_mode()
def generate(
        mmgpt: MultiModalityCausalLM,
        vl_chat_processor: VLChatProcessor,
        prompt: str,
        temperature: float = 1,
        parallel_size: int = 1, # 减小 parallel_size
        cfg_weight: float = 5,
        image_token_num_per_image: int = 576,
        img_size: int = 384,
        patch_size: int = 16,
):
    input_ids = vl_chat_processor.tokenizer.encode(prompt)
    input_ids = torch.LongTensor(input_ids)

    tokens = torch.zeros((parallel_size * 2, len(input_ids)), dtype=torch.int).cuda()
    for i in range(parallel_size * 2):
        tokens[i, :] = input_ids
        if i % 2 != 0:
            tokens[i, 1:-1] = vl_chat_processor.pad_id

    inputs_embeds = mmgpt.language_model.get_input_embeddings()(tokens)

    generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int).cuda()

    for i in range(image_token_num_per_image):
        outputs = mmgpt.language_model.model(inputs_embeds=inputs_embeds, use_cache=True,
                                             past_key_values=outputs.past_key_values if i != 0 else None)
        hidden_states = outputs.last_hidden_state

        logits = mmgpt.gen_head(hidden_states[:, -1, :])
        logit_cond = logits[0::2, :]
        logit_uncond = logits[1::2, :]

        logits = logit_uncond + cfg_weight * (logit_cond - logit_uncond)
        probs = torch.softmax(logits / temperature, dim=-1)

        next_token = torch.multinomial(probs, num_samples=1)
        generated_tokens[:, i] = next_token.squeeze(dim=-1)
        next_token = torch.cat([next_token.unsqueeze(dim=1),
                                next_token.unsqueeze(dim=1)], dim=1).view(-1)
        img_embeds = mmgpt.prepare_gen_img_embeds(next_token)
        inputs_embeds = img_embeds.unsqueeze(dim=1)
        # 添加显存清理
        del logits, logit_cond, logit_uncond, probs
        torch.cuda.empty_cache()

    dec = mmgpt.gen_vision_model.decode_code(generated_tokens.to(dtype=torch.int),
                                             shape=[parallel_size, 8, img_size // patch_size, img_size // patch_size])
    dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)

    dec = np.clip((dec + 1) / 2 * 255, 0, 255)

    visual_img = np.zeros((parallel_size, img_size, img_size, 3), dtype=np.uint8)
    visual_img[:, :, :] = dec

    os.makedirs('generated_samples', exist_ok=True)
    for i in range(parallel_size):
        save_path = os.path.join('generated_samples', f"img_{i}.jpg")
        img = Image.fromarray(visual_img[i])
        img.save(save_path)

generate(
    vl_gpt,
    vl_chat_processor,
    prompt,
)

注意：根据实际情况，修改model_path，conversation，parallel_size这3个参数即可。

提示词是可以写中文的，不一定非要是英文。

代码在默认的基础上做了优化，否则运行会导致英伟达5080显卡直接卡死。

运行代码，效果如下：

# python image_generation.py
Python version is above 3.10, patching the collections module.
/root/anaconda3/envs/vll-Janus-Pro-7B/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:604: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
warnings.warn(
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 1/1 [00:09<00:00, 4.58s/it]

注意观察一下GPU使用情况，这里会很高。

RTX 5080显卡，16GB显存，几乎已经占满了。内存，大概用了30G左右

不过不要担心，大概持续30秒左右，GPU利用率会瞬间降到3%

等待30秒左右，就会生成一张图片。

打开小企鹅

进入目录\home\xiao\vllm\Janus\generated_samples

这里会出现一张图片

打开图片，效果如下：

效果还算可以，距离真正的8k画质，还是有点差距的。

注意提示词，尽量丰富一点，生成的图片，才符合要求。

如果不会写提示词，可以让deepseek帮你写一段提示词。

六、答疑解惑

为啥要本地部署？

效果，大家也看到了，生成的图片不太理想。但是图片识别能力，还是挺好的。

普通用户，一般只需要花钱买token就可以了，比如大家熟悉的，即梦，AI 绘蛙，Pic Copilot等等。

对于一个技术人员来说，人生啊，就是要折腾。

你折腾一番之后，有了成果，就会感到很满足。

只能生成384x384的图片吗？

我有尝试过，将图片分辨率扩大到800x600，但是发现图片看着不正常，明显被拉伸。

官方给的demo，就只有384x384，为啥呢？我也不知道。可能是因为7b参数太少了吧，无法生成更高的分辨率。

一次只能生成一张图片吗？

官方给的demo，一次能生成16张图片。但是我的RTX 5080显卡，完全扛不住。其实生成3张也可以，修改参数

parallel_size: int = 3

但是运行代码之后，时间会特别长，大概3分钟左右。

运行完成之后，你会发现3张图片，有概率会出现某张图片，人物的脸型会出现扭曲，看着很怪异。可能7b对于生成人物不太擅长吧，或者因为7b参数太少了。

七、硬件购买建议

国家补贴

买电脑，国家最高补贴2000元。

这里要重点说明一下，为了拿这个补贴，我踩了不少坑。

持续时间

买电脑的国家补贴政策 2024 年就有了，2025 年仍在延续，按目前政策，补贴持续到 2025 年 12 月 31 日，但部分补贴资金紧张的地区可能会提前结束。

领取方式

在拼多多，京东，淘宝，这些电商页面，选择有带有国家补贴的电脑即可。

注意：并不是所有电脑都有补贴的，只有指定型号才有的。具体型号，以购买页面为准。

重点说明一下，一个身份证号，只能领取一次。

啥意思呢，比如你在拼多多领取了补贴，那么就不能在京东使用补贴。如果你想在京东使用补贴，就需要把拼多多的给退掉即可。

地区限制

这个要重点说明一下，国家补贴是有地区限制的，并不是每个地区都有补贴的，基本上一线城市都有。

假如你领取了浙江的补贴，但是你的收货地址是上海，这种情况是不能使用补贴的。

也就是说，你的收货地址在哪里，那么你就领取对应地方的补贴即可。

假如地区领取错了，怎么办？也简单，退掉即可。

假如你领取了浙江的补贴，但是你的收货地址是上海。那么你就随便写一个浙江的收货地址，然后在领取国家补贴页面，把补贴退掉即可。

最后再次选择上海地区，领取补贴。

使用方式

必须使用云闪付支付，微信，支付宝是不支持的。

领取了补贴之后，然后购买商品，选择云闪付。我基本上都是用支付宝，微信。云闪付都没用过，也简单。直接下载云闪付app，然后绑定银行卡即可。

付款的时候，就可以看到商品价格少了2000块钱。

注意：补贴使用了之后，资格就没有了。因为一个身份证，只能用一次。

退货情况

假设你购买的电脑有质量问题，发生了退货情况。但是资格已经用了，是否还可以用补贴。答案是可以的，等待商家退款之后，必须要等到第2天，就可以再次领取补贴。

收货情况

使用国家补贴，快递人员，需要对电脑进行拍照。需要拆除电脑包装，拍照电脑背面的sn码，就可以了。

电脑不需要开机演示

最后，有条件的，就尽快购买吧。因为2000块补贴，可能到年底就没了，或者提前结束，都有可能。

显卡购买建议

作为一个技术人员，看到很多博主在本地部署AI大模型，羡慕很。是不是心里痒痒，也想动手玩玩呢？

答案，肯定是想的。为了买游戏本，我本人可是筹备了4个多月，经常看头条各种文章，看看人家买的什么，哪些比较划算。

对于显卡，我觉得RTX 5080就足够了，毕竟有16G显存。当然还有更高端的RTX 5090，显存有32G，就单独说这显卡，售价在2.5w~3.2w之间。

RTX 5090太贵了，买不起。RTX 5080，售价在8299~11999，性价比还是可以的。

RTX 4060显存8G，太低了，不值得购买。

RTX 4070显存12G，这个可以考虑一下，我室友买的就是这个。

内存购买建议

内存建议直接配置64GB，一般游戏本主板最大支持64GB。直接拉满就对了，因为大模型运行确实比耗费内存。

我本以为64GB就是极限了，但是昨天看到一个游戏本，主板最大支持192GB，提供4个插槽。太牛逼了，不服不行。

硬盘购买建议

硬件直接买固态硬盘，不要买机械硬盘，太慢了。

对于容量，至少1TB，最好配2TB。我看很多玩3A游戏的博主，直接弄的2TB，索性我就加配置搞到2TB，因为主板有2个硬盘插槽。

现在1TB的固态硬盘也不贵，600多就搞定了。

2TB硬盘，一个给C盘，一个给D盘。目前D盘已经使用了50%，大部分都放虚拟机了，确实比较耗费磁盘。

最后说一下，人性啊，是很奇怪的。叫你去买tokens，你就很抠。总想着免费，白玩。

叫你去买显卡，1万买了，倒是挺爽快的。

然后呢，你就可以24小时，叫显卡去干活，生成图片，玩3A游戏都行。

你也不用担心，像前段时间deepseek一样，经常宕机，服务器繁忙。

显卡就是你的，你想怎么玩，就怎么玩。我的世界，我做主宰。

原创作者: xiao987334176 转载于: https://www.cnblogs.com/xiao987334176/p/18864903