tencent/Hy3-preview
Tencent Hunyuan Hy3-preview — scaled-up MoE language model (295B total / 21B active) with a 3.8B MTP layer for speculative decoding, 256K context, and hy_v3 tool/reasoning parsers
View on HuggingFaceGuide
Hy3-preview Usage Guide
Hy3-preview is Tencent Hunyuan's latest open-source Mixture-of-Experts language model: 295B total parameters with 21B activated per token, plus a 3.8B MTP layer for speculative decoding. 80 transformer layers, 192 routed experts (top-8) + 1 shared expert, GQA with 64 heads over 8 KV heads, 256K context.
A pretrained base checkpoint is published at tencent/Hy3-preview-Base; this recipe covers the instruct model.
Setup
Choose one of the following setup methods.
Using Docker
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:hy3-preview tencent/Hy3-preview \
--tensor-parallel-size 8 \
--tool-call-parser hy_v3 \
--reasoning-parser hy_v3 \
--enable-auto-tool-choice \
--served-model-name hy3-preview
Installing from source
uv venv --python 3.12 --seed --managed-python
source .venv/bin/activate
git clone https://github.com/vllm-project/vllm.git
cd vllm
uv pip install --editable . --torch-backend=auto
Model Deployment
To serve Hy3-preview on 8 GPUs, use H20-3e(141GB), H200, or other GPUs with larger memory capacity. Smaller-memory 8-GPU configurations (8×H100 80GB, 8×A100 80GB) do not fit the BF16 weights plus KV cache — use multi-node TP for those.
Serving on 8×H200 or 8×H20-3e(141GB)
Without Multi-Token Prediction (MTP):
vllm serve tencent/Hy3-preview \
--tensor-parallel-size 8 \
--tool-call-parser hy_v3 \
--reasoning-parser hy_v3 \
--enable-auto-tool-choice \
--served-model-name hy3-preview
With MTP (recommended for lower latency):
vllm serve tencent/Hy3-preview \
--tensor-parallel-size 8 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser hy_v3 \
--reasoning-parser hy_v3 \
--enable-auto-tool-choice \
--served-model-name hy3-preview
Sampling and Reasoning Modes
Tencent's recommended sampling parameters: temperature=0.9, top_p=1.0.
Reasoning is controlled via chat_template_kwargs.reasoning_effort:
| Value | Behavior |
|---|---|
no_think (default) | Direct response, no chain-of-thought |
low | Light reasoning |
high | Deep chain-of-thought for math/coding/complex reasoning |
When tools are registered, set interleaved_thinking: true to allow the model to
think between tool calls.
OpenAI Client Example
uv pip install -U openai
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello."},
]
# Direct response (default).
resp = client.chat.completions.create(
model="hy3-preview",
messages=messages,
temperature=0.9,
top_p=1.0,
max_tokens=4096,
)
print(resp.choices[0].message.content)
# Deep reasoning: set reasoning_effort (and interleaved_thinking if using tools).
resp_think = client.chat.completions.create(
model="hy3-preview",
messages=messages,
temperature=0.9,
top_p=1.0,
max_tokens=4096,
extra_body={
"chat_template_kwargs": {
"reasoning_effort": "high",
"interleaved_thinking": True,
},
},
)
output_msg = resp_think.choices[0].message
print(output_msg.reasoning_content) # chain-of-thought
print(output_msg.content) # final answer
cURL Usage
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "hy3-preview",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello."}
],
"temperature": 0.9,
"top_p": 1.0,
"max_tokens": 4096
}'
Benchmarking
For benchmarking, disable prefix caching by adding --no-enable-prefix-caching to
the server command.
The following uses 8×H20-3e(141GB) as an example.
vllm bench serve \
--model tencent/Hy3-preview \
--dataset-name random \
--random-input-len 8192 \
--random-output-len 1024 \
--max-concurrency 32 \
--num-prompts 160 \
--served-model-name hy3-preview
Representative output:
============ Serving Benchmark Result ============
Successful requests: 160
Failed requests: 0
Maximum request concurrency: 32
Benchmark duration (s): 280.58
Total input tokens: 1310720
Total generated tokens: 163840
Request throughput (req/s): 0.57
Output token throughput (tok/s): 583.93
Peak output token throughput (tok/s): 1024.00
Peak concurrent requests: 36.00
Total token throughput (tok/s): 5255.36
---------------Time to First Token----------------
Mean TTFT (ms): 4542.41
Median TTFT (ms): 2762.17
P99 TTFT (ms): 21062.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 50.34
Median TPOT (ms): 51.77
P99 TPOT (ms): 54.07
---------------Inter-token Latency----------------
Mean ITL (ms): 50.34
Median ITL (ms): 34.32
P99 ITL (ms): 689.10
==================================================