tencent/Hy3-preview

Tencent Hunyuan Hy3-preview — scaled-up MoE language model (295B total / 21B active) with a 3.8B MTP layer for speculative decoding, 256K context, and hy_v3 tool/reasoning parsers

View on HuggingFace

moe295B / 21B262,144 ctxvLLM 0.20.1+text

Guide

Hy3-preview Usage Guide

Hy3-preview is Tencent Hunyuan's latest open-source Mixture-of-Experts language model: 295B total parameters with 21B activated per token, plus a 3.8B MTP layer for speculative decoding. 80 transformer layers, 192 routed experts (top-8) + 1 shared expert, GQA with 64 heads over 8 KV heads, 256K context.

A pretrained base checkpoint is published at tencent/Hy3-preview-Base; this recipe covers the instruct model.

Setup

Choose one of the following setup methods.

Using Docker

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:hy3-preview tencent/Hy3-preview \
    --tensor-parallel-size 8 \
    --tool-call-parser hy_v3 \
    --reasoning-parser hy_v3 \
    --enable-auto-tool-choice \
    --served-model-name hy3-preview

Installing from source

uv venv --python 3.12 --seed --managed-python
source .venv/bin/activate
git clone https://github.com/vllm-project/vllm.git
cd vllm
uv pip install --editable . --torch-backend=auto

Model Deployment

To serve Hy3-preview on 8 GPUs, use H20-3e(141GB), H200, or other GPUs with larger memory capacity. Smaller-memory 8-GPU configurations (8×H100 80GB, 8×A100 80GB) do not fit the BF16 weights plus KV cache — use multi-node TP for those.

Serving on 8×H200 or 8×H20-3e(141GB)

Without Multi-Token Prediction (MTP):

vllm serve tencent/Hy3-preview \
  --tensor-parallel-size 8 \
  --tool-call-parser hy_v3 \
  --reasoning-parser hy_v3 \
  --enable-auto-tool-choice \
  --served-model-name hy3-preview

With MTP (recommended for lower latency):

vllm serve tencent/Hy3-preview \
  --tensor-parallel-size 8 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser hy_v3 \
  --reasoning-parser hy_v3 \
  --enable-auto-tool-choice \
  --served-model-name hy3-preview

Sampling and Reasoning Modes

Tencent's recommended sampling parameters: temperature=0.9, top_p=1.0.

Reasoning is controlled via chat_template_kwargs.reasoning_effort:

Value	Behavior
`no_think` (default)	Direct response, no chain-of-thought
`low`	Light reasoning
`high`	Deep chain-of-thought for math/coding/complex reasoning

When tools are registered, set interleaved_thinking: true to allow the model to think between tool calls.

OpenAI Client Example

uv pip install -U openai

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello."},
]

# Direct response (default).
resp = client.chat.completions.create(
    model="hy3-preview",
    messages=messages,
    temperature=0.9,
    top_p=1.0,
    max_tokens=4096,
)
print(resp.choices[0].message.content)

# Deep reasoning: set reasoning_effort (and interleaved_thinking if using tools).
resp_think = client.chat.completions.create(
    model="hy3-preview",
    messages=messages,
    temperature=0.9,
    top_p=1.0,
    max_tokens=4096,
    extra_body={
      "chat_template_kwargs": {
          "reasoning_effort": "high",
          "interleaved_thinking": True,
      },
    },
)
output_msg = resp_think.choices[0].message
print(output_msg.reasoning_content)  # chain-of-thought
print(output_msg.content)             # final answer

cURL Usage

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "hy3-preview",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello."}
    ],
    "temperature": 0.9,
    "top_p": 1.0,
    "max_tokens": 4096
  }'

Benchmarking

For benchmarking, disable prefix caching by adding --no-enable-prefix-caching to the server command.

The following uses 8×H20-3e(141GB) as an example.

vllm bench serve \
  --model tencent/Hy3-preview \
  --dataset-name random \
  --random-input-len 8192 \
  --random-output-len 1024 \
  --max-concurrency 32 \
  --num-prompts 160 \
  --served-model-name hy3-preview

Representative output:

============ Serving Benchmark Result ============
Successful requests:                     160       
Failed requests:                         0         
Maximum request concurrency:             32        
Benchmark duration (s):                  280.58    
Total input tokens:                      1310720   
Total generated tokens:                  163840    
Request throughput (req/s):              0.57      
Output token throughput (tok/s):         583.93    
Peak output token throughput (tok/s):    1024.00   
Peak concurrent requests:                36.00     
Total token throughput (tok/s):          5255.36   
---------------Time to First Token----------------
Mean TTFT (ms):                          4542.41   
Median TTFT (ms):                        2762.17   
P99 TTFT (ms):                           21062.96  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          50.34     
Median TPOT (ms):                        51.77     
P99 TPOT (ms):                           54.07     
---------------Inter-token Latency----------------
Mean ITL (ms):                           50.34     
Median ITL (ms):                         34.32     
P99 ITL (ms):                            689.10    
==================================================