deepseek-ai/DeepSeek-V4-Pro
DeepSeek V4 flagship MoE (1.6T total / 49B active) with hybrid CSA+HCA attention, manifold-constrained hyper-connections, Muon-trained on 32T+ tokens, and three-tier reasoning.
View on HuggingFaceGuide
Overview
DeepSeek-V4-Pro is the flagship of the V4 preview family: a 1.6T-total / 49B-active Mixture-of-Experts model. It pairs a hybrid attention stack — Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) — with Manifold-Constrained Hyper-Connections (mHC) to reach 27% of V3.2's per-token inference FLOPs and 10% of V3.2's KV cache at 1M context. Pre-trained on 32T+ tokens with the Muon optimizer for faster convergence; post-training is a two-stage pipeline (domain-specific expert cultivation + unified consolidation via on-policy distillation).
Checkpoint is FP4+FP8 mixed: MoE expert weights are stored in FP4 while the remaining (attention / norm / router) params stay in FP8.
Reasoning modes
The chat template exposes three reasoning-effort modes:
- Non-think — fast, intuitive responses.
- Think High — explicit chain-of-thought for complex problem-solving and planning.
- Think Max — maximum reasoning effort; requires
--max-model-len >= 393216(384K tokens) to avoid truncation.
Recommended sampling: temperature = 1.0, top_p = 1.0.
OpenAI Client Example
For DeepSeek-V4, keep reasoning controls in chat_template_kwargs, as it exposes a
custom Think Max mode via "reasoning_effort": "max".
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
model = "deepseek-ai/DeepSeek-V4-Pro"
messages = [{"role": "user", "content": "What is 17*19? Return only the final integer."}]
# Non-think
resp = client.chat.completions.create(
model=model,
messages=messages,
)
# Think High
resp = client.chat.completions.create(
model=model,
messages=messages,
extra_body={
"chat_template_kwargs": {
"thinking": True,
"reasoning_effort": "high",
},
},
)
# Think Max
resp = client.chat.completions.create(
model=model,
messages=messages,
extra_body={
"chat_template_kwargs": {
"thinking": True,
"reasoning_effort": "max",
},
},
)
Recommended deployments
- B300 (8× GPU): single-node DP + EP with
--data-parallel-size 8. - H200 (8× GPU): DP + EP with
--data-parallel-size 8. Context is capped at 800K tokens (--max-model-len 800000) to leave KV headroom with dense params replicated across ranks — applies to both single-node and multi-node H200. - GB200 NVL4 (4× GPU per tray): the ~960 GB mixed-precision checkpoint does not
fit on one tray; run multi-node DP + EP across 2 trays (8 GPUs total) with
--data-parallel-size 8. Pick the "Multi-Node" tab and set nodes to 2.