此页面由 Cloud Translation API 翻译。

使用多主机 GPU 部署提供 DeepSeek-V3 模型
使用集合让一切井井有条根据您的偏好保存内容并对其进行分类。

概览

Vertex AI Prediction 支持多主机 GPU 部署，可用于提供超出单个 GPU 节点内存容量的模型，例如 DeepSeek-V3、DeepSeek-R1 和 Meta LLama3.1 405（非量化版本）。

本指南介绍了如何通过 vLLM 在 Vertex AI Prediction 上使用多主机图形处理单元 (GPU) 提供 DeepSeek-V3 模型。其他型号的设置类似。如需了解详情，请参阅适用于文本和多模态语言模型的 vLLM 服务。

开始之前，请确保您熟悉以下内容：

请使用价格计算器根据您的预计用量来估算费用。

容器

为了支持多主机部署，本指南使用了 Model Garden 中与 Ray 集成的预构建 vLLM 容器映像。Ray 支持跨多个 GPU 节点运行模型所需的分布式处理。此容器还支持使用 Chat Completions API 处理流式传输请求。

如果需要，您可以创建您自己的 vLLM 多节点映像。请注意，此自定义容器映像需要与 Vertex AI Prediction 兼容。

准备工作

在开始部署模型之前，请完成本部分中列出的前提条件。

设置 Google Cloud 项目

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Enable the API

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

申请 GPU 配额

如需部署 DeepSeek-V3，您需要两个 a3-highgpu-8g 虚拟机，每个虚拟机配备 8 个 H100 GPU，总共 16 个 H100 GPU。您可能需要申请增加 H100 GPU 配额，因为默认值低于 16。

如需查看 H100 GPU 配额，请前往 Google Cloud 控制台的配额和系统限制页面。

前往“配额和系统限制”页面
申请配额调整。

上传模型

如需将模型作为 Model 资源上传到 Vertex AI Prediction，请运行 gcloud ai models upload 命令，如下所示：

gcloud ai models upload \
    --region=LOCATION \
    --project=PROJECT_ID \
    --display-name=MODEL_DISPLAY_NAME \
    --container-image-uri=us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250130_0916_RC01 \
    --container-args='^;^/vllm-workspace/ray_launcher.sh;python;-m;vllm.entrypoints.api_server;--host=0.0.0.0;--port=8080;--model=deepseek-ai/DeepSeek-V3;--tensor-parallel-size=16;--pipeline-parallel-size=1;--gpu-memory-utilization=0.9;--trust-remote-code;--max-model-len=32768' \
    --container-deployment-timeout-seconds=4500 \
    --container-ports=8080 \
    --container-env-vars=MODEL_ID=deepseek-ai/DeepSeek-V3

进行以下替换：

LOCATION：您在其中使用 Vertex AI 的区域
PROJECT_ID：您的 Google Cloud 项目的 ID
MODEL_DISPLAY_NAME：您希望用于模型的显示名称

创建专用在线预测端点

为了支持聊天完成请求，Model Garden 容器需要一个专用端点。专用端点处于预览阶段，不支持 Google Cloud CLI，因此您需要使用 REST API 创建端点。

如需创建专用端点，请运行以下命令：

PROJECT_ID=PROJECT_ID
REGION=LOCATION
ENDPOINT="${REGION}-aiplatform.googleapis.com"

curl \
  -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://${ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${REGION}/endpoints \
  -d '{
    "displayName": "ENDPOINT_DISPLAY_NAME",
    "dedicatedEndpointEnabled": true
    }'

进行以下替换：

ENDPOINT_DISPLAY_NAME：端点的显示名称

部署模型

运行 gcloud ai endpoints list 命令，获取在线预测端点的 ID：

ENDPOINT_ID=$(gcloud ai endpoints list \
 --project=PROJECT_ID \
 --region=LOCATION \
 --filter=display_name~'ENDPOINT_DISPLAY_NAME' \
 --format="value(name)")

运行 gcloud ai models list 命令，获取模型的模型 ID：

MODEL_ID=$(gcloud ai models list \
 --project=PROJECT_ID \
 --region=LOCATION \
 --filter=display_name~'MODEL_DISPLAY_NAME' \
 --format="value(name)")

通过运行 gcloud ai deploy-model 命令将模型部署到端点：
```
gcloud alpha ai endpoints deploy-model $ENDPOINT_ID \
 --project=PROJECT_ID \
 --region=LOCATION \
 --model=$MODEL_ID \
 --display-name="DEPLOYED_MODEL_NAME" \
 --machine-type=a3-highgpu-8g \
 --traffic-split=0=100 \
 --accelerator=type=nvidia-h100-80gb,count=8 \
 --multihost-gpu-node-count=2
```
将 DEPLOYED_MODEL_NAME 替换为所部署模型的名称。这可以与模型显示名称 (MODEL_DISPLAY_NAME) 相同。

部署大模型（如 DeepSeek-V3）可能需要的时间会超过默认的部署超时时间。如果 deploy-model 命令超时，部署流程会继续在后台运行。

deploy-model 命令会返回操作 ID，可用于检查操作完成时间。您可以轮询操作状态，直到响应包含 "done": true。可使用以下命令轮询状态：
```
gcloud ai operations describe \
--region=LOCATION \
OPERATION_ID
```
将 OPERATION_ID 替换为上一条命令返回的操作 ID。

从已部署的模型获取在线预测结果

本部分介绍了如何向部署了 DeepSeek-V3 模型的专用公共端点发送在线预测请求。

通过运行 gcloud projects describe 命令获取项目编号：

PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")

发送原始预测请求：

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}:rawPredict \
-d '{
   "prompt": "Write a short story about a robot.",
   "stream": false,
   "max_tokens": 50,
   "temperature": 0.7
   }'

发送 Chat Completion 请求：

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}/chat/completions \
-d '{"stream":false, "messages":[{"role": "user", "content": "Summer travel plan to Paris"}], "max_tokens": 40,"temperature":0.4,"top_k":10,"top_p":0.95, "n":1}'

如需启用流式传输，请将 "stream" 的值从 false 更改为 true。

清理

为避免产生额外的 Vertex AI 费用，请删除您在本教程中创建的 Google Cloud 资源：

如需从端点取消部署模型并删除端点，请运行以下命令：

ENDPOINT_ID=$(gcloud ai endpoints list \
   --region=LOCATION \
   --filter=display_name=ENDPOINT_DISPLAY_NAME \
   --format="value(name)")

DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \
   --region=LOCATION \
   --format="value(deployedModels.id)")

gcloud ai endpoints undeploy-model $ENDPOINT_ID \
  --region=LOCATION \
  --deployed-model-id=$DEPLOYED_MODEL_ID

gcloud ai endpoints delete $ENDPOINT_ID \
   --region=LOCATION \
   --quiet

如需删除模型，请运行以下命令：

MODEL_ID=$(gcloud ai models list \
   --region=LOCATION \
   --filter=display_name=DEPLOYED_MODEL_NAME \
   --format="value(name)")

gcloud ai models delete $MODEL_ID \
   --region=LOCATION \
   --quiet

后续步骤

如需有关在 Vertex AI Prediction 上使用 vLLM 进行多主机 GPU 部署的全面参考信息，请参阅适用于文本和多模态语言模型的 vLLM 服务。
了解如何创建您自己的 vLLM 多节点映像。请注意，您的自定义容器映像需要与 Vertex AI Prediction 兼容。