deepseek-ai/DeepSeek-V4-Flash
DeepSeek V4 MoE model with hybrid CSA+HCA attention, manifold-constrained hyper-connections, and three-tier reasoning (Non-think / Think High / Think Max).
View on HuggingFaceGuide
Overview
DeepSeek-V4-Flash is a 284B-total / 13B-active MoE model in the V4 preview family. It pairs a hybrid attention stack — Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) — with Manifold-Constrained Hyper-Connections (mHC) to reach 27% of V3.2's per-token inference FLOPs and 10% of V3.2's KV cache at 1M context. Pre-trained on 32T+ tokens; post-training is a two-stage pipeline (domain- specific expert cultivation + unified consolidation via on-policy distillation).
Checkpoint is FP4+FP8 mixed: MoE expert weights are stored in FP4 while the remaining (attention / norm / router) params stay in FP8.
Reasoning modes
The chat template exposes three reasoning-effort modes:
- Non-think — fast, intuitive responses.
- Think High — explicit chain-of-thought for logical analysis and planning.
- Think Max — maximum reasoning effort; requires
--max-model-len >= 393216(384K tokens) to avoid truncation.
Recommended sampling: temperature = 1.0, top_p = 1.0.
OpenAI Client Example
For DeepSeek-V4, keep reasoning controls in chat_template_kwargs, as it exposes a
custom Think Max mode via "reasoning_effort": "max".
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
model = "deepseek-ai/DeepSeek-V4-Flash"
messages = [{"role": "user", "content": "What is 17*19? Return only the final integer."}]
# Non-think
resp = client.chat.completions.create(
model=model,
messages=messages,
)
# Think High
resp = client.chat.completions.create(
model=model,
messages=messages,
extra_body={
"chat_template_kwargs": {
"thinking": True,
"reasoning_effort": "high",
},
},
)
# Think Max
resp = client.chat.completions.create(
model=model,
messages=messages,
extra_body={
"chat_template_kwargs": {
"thinking": True,
"reasoning_effort": "max",
},
},
)
Recommended deployment
Non-disaggregated serving on any supported hardware: single-node DP + EP with
--data-parallel-size 4. Fills a GB200 NVL4 tray exactly; uses 4 of 8 GPUs per
replica on H200/B200/B300 (leaving headroom for throughput-vs-latency tuning).
For disaggregated prefill/decode on GB200, use the PD Cluster tab.
H200 Single-Node PD (Mooncake)
Single-host disaggregated serving: 4 prefill GPUs + 4 decode GPUs on one 8-GPU H200 node, using MooncakeConnector over RDMA for KV cache transfer.
Prefill (GPUs 0–3, port 8000):
docker run --gpus all \
--privileged --ipc=host -p 8000:8000 \
--network host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v /mnt/shared:/mnt/shared \
-e TILELANG_CLEANUP_TEMP_FILES=1 \
-e VLLM_DISABLE_COMPILE_CACHE=1 \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
-e VLLM_RPC_TIMEOUT=600000 \
-e VLLM_LOG_STATS_INTERVAL=1 \
-e VLLM_MOONCAKE_BOOTSTRAP_PORT=8998 \
-e CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm/vllm-openai:deepseekv4-cu130 \
deepseek-ai/DeepSeek-V4-Flash \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--port 8000 \
--data-parallel-size 4 \
--enable-expert-parallel \
--tokenizer-mode deepseek_v4 \
--reasoning-parser deepseek_v4 \
--max-model-len auto \
--max-num-batched-tokens 16384 \
--max-num-seqs 8 \
--enforce-eager \
--no-disable-hybrid-kv-cache-manager \
--disable-uvicorn-access-log \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both","kv_load_failure_policy":"fail","kv_buffer_device":"cuda","kv_connector_extra_config":{"enforce_handshake_compat":false,"mooncake_protocol":"rdma"}}'
Decode (GPUs 4–7, port 8001):
docker run --gpus all \
--privileged --ipc=host -p 8001:8001 \
--network host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v /mnt/shared:/mnt/shared \
-e TILELANG_CLEANUP_TEMP_FILES=1 \
-e VLLM_DISABLE_COMPILE_CACHE=1 \
-e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
-e VLLM_RPC_TIMEOUT=600000 \
-e VLLM_LOG_STATS_INTERVAL=1 \
-e VLLM_MOONCAKE_BOOTSTRAP_PORT=9889 \
-e CUDA_VISIBLE_DEVICES=4,5,6,7 \
vllm/vllm-openai:deepseekv4-cu130 \
deepseek-ai/DeepSeek-V4-Flash \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--port 8001 \
--data-parallel-size 4 \
--enable-expert-parallel \
--tokenizer-mode deepseek_v4 \
--reasoning-parser deepseek_v4 \
--max-model-len auto \
--max-num-seqs 512 \
--max-num-batched-tokens 512 \
--compilation-config '{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY","max_cudagraph_capture_size":512,"compile_ranges_endpoints":[512]}' \
--no-disable-hybrid-kv-cache-manager \
--disable-uvicorn-access-log \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both","kv_load_failure_policy":"fail","kv_buffer_device":"cuda","kv_connector_extra_config":{"enforce_handshake_compat":false,"mooncake_protocol":"rdma"}}'
pip install vllm-router
vllm-router --policy round_robin \
--vllm-pd-disaggregation \
--prefill http://localhost:8000 \
--decode http://localhost:8001 \
--host 127.0.0.1 \
--port 30000 \
--intra-node-data-parallel-size 4 \
--kv-connector mooncake