vLLM/Recipes
DeepSeek

deepseek-ai/DeepSeek-V4-Flash

DeepSeek V4 MoE model with hybrid CSA+HCA attention, manifold-constrained hyper-connections, and three-tier reasoning (Non-think / Think High / Think Max).

View on HuggingFace
moe284B / 13B1,048,576 ctxvLLM 0.20.1+text
Guide

Overview

DeepSeek-V4-Flash is a 284B-total / 13B-active MoE model in the V4 preview family. It pairs a hybrid attention stack — Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) — with Manifold-Constrained Hyper-Connections (mHC) to reach 27% of V3.2's per-token inference FLOPs and 10% of V3.2's KV cache at 1M context. Pre-trained on 32T+ tokens; post-training is a two-stage pipeline (domain- specific expert cultivation + unified consolidation via on-policy distillation).

Checkpoint is FP4+FP8 mixed: MoE expert weights are stored in FP4 while the remaining (attention / norm / router) params stay in FP8.

Reasoning modes

The chat template exposes three reasoning-effort modes:

  • Non-think — fast, intuitive responses.
  • Think High — explicit chain-of-thought for logical analysis and planning.
  • Think Max — maximum reasoning effort; requires --max-model-len >= 393216 (384K tokens) to avoid truncation.

Recommended sampling: temperature = 1.0, top_p = 1.0.

OpenAI Client Example

For DeepSeek-V4, keep reasoning controls in chat_template_kwargs, as it exposes a custom Think Max mode via "reasoning_effort": "max".

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
model = "deepseek-ai/DeepSeek-V4-Flash"
messages = [{"role": "user", "content": "What is 17*19? Return only the final integer."}]

# Non-think
resp = client.chat.completions.create(
    model=model,
    messages=messages,
)

# Think High
resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "reasoning_effort": "high",
        },
    },
)

# Think Max
resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "reasoning_effort": "max",
        },
    },
)

Non-disaggregated serving on any supported hardware: single-node DP + EP with --data-parallel-size 4. Fills a GB200 NVL4 tray exactly; uses 4 of 8 GPUs per replica on H200/B200/B300 (leaving headroom for throughput-vs-latency tuning). For disaggregated prefill/decode on GB200, use the PD Cluster tab.

H200 Single-Node PD (Mooncake)

Single-host disaggregated serving: 4 prefill GPUs + 4 decode GPUs on one 8-GPU H200 node, using MooncakeConnector over RDMA for KV cache transfer.

Prefill (GPUs 0–3, port 8000):

docker run --gpus all \
  --privileged --ipc=host -p 8000:8000 \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v /mnt/shared:/mnt/shared \
  -e TILELANG_CLEANUP_TEMP_FILES=1 \
  -e VLLM_DISABLE_COMPILE_CACHE=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_RPC_TIMEOUT=600000 \
  -e VLLM_LOG_STATS_INTERVAL=1 \
  -e VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
  vllm/vllm-openai:deepseekv4-cu130 \
  deepseek-ai/DeepSeek-V4-Flash \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --port 8000 \
  --data-parallel-size 4 \
  --enable-expert-parallel \
  --tokenizer-mode deepseek_v4 \
  --reasoning-parser deepseek_v4 \
  --max-model-len auto \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 8 \
  --enforce-eager \
  --no-disable-hybrid-kv-cache-manager \
  --disable-uvicorn-access-log \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both","kv_load_failure_policy":"fail","kv_buffer_device":"cuda","kv_connector_extra_config":{"enforce_handshake_compat":false,"mooncake_protocol":"rdma"}}'

Decode (GPUs 4–7, port 8001):

docker run --gpus all \
  --privileged --ipc=host -p 8001:8001 \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v /mnt/shared:/mnt/shared \
  -e TILELANG_CLEANUP_TEMP_FILES=1 \
  -e VLLM_DISABLE_COMPILE_CACHE=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_RPC_TIMEOUT=600000 \
  -e VLLM_LOG_STATS_INTERVAL=1 \
  -e VLLM_MOONCAKE_BOOTSTRAP_PORT=9889 \
  -e CUDA_VISIBLE_DEVICES=4,5,6,7 \
  vllm/vllm-openai:deepseekv4-cu130 \
  deepseek-ai/DeepSeek-V4-Flash \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --port 8001 \
  --data-parallel-size 4 \
  --enable-expert-parallel \
  --tokenizer-mode deepseek_v4 \
  --reasoning-parser deepseek_v4 \
  --max-model-len auto \
  --max-num-seqs 512 \
  --max-num-batched-tokens 512 \
  --compilation-config '{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY","max_cudagraph_capture_size":512,"compile_ranges_endpoints":[512]}' \
  --no-disable-hybrid-kv-cache-manager \
  --disable-uvicorn-access-log \
  --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both","kv_load_failure_policy":"fail","kv_buffer_device":"cuda","kv_connector_extra_config":{"enforce_handshake_compat":false,"mooncake_protocol":"rdma"}}'

Router:

pip install vllm-router

vllm-router --policy round_robin \
  --vllm-pd-disaggregation \
  --prefill http://localhost:8000 \
  --decode http://localhost:8001 \
  --host 127.0.0.1 \
  --port 30000 \
  --intra-node-data-parallel-size 4 \
  --kv-connector mooncake