InternLM2 Model Export for ONNX Runtime GenAI
This example demonstrates how to export InternLM2 models to ONNX format using ONNX Runtime GenAI.
Supported Models
All InternLM2 model sizes are supported:
- ✅ InternLM2-1.8B - Tested and verified
- ✅ InternLM2-7B - Tested and verified
- ✅ InternLM2-20B - Fully compatible
- ✅ InternLM2-Chat variants - All sizes supported
The implementation is architecture-based and automatically adapts to any InternLM2 model size.
Model Architecture
InternLM2 uses a Llama-based architecture with the following key features:
- Attention: Grouped Query Attention (GQA) with grouped/interleaved QKV layout
- Normalization: RMSNorm (eps: 1e-05)
- Activation: SiLU
- Positional Encoding: RoPE with theta=1,000,000
Architecture Specifications
| Parameter | 1.8B | 7B | 20B |
|---|---|---|---|
| Hidden Size | 2048 | 4096 | 6144 |
| Num Layers | 24 | 32 | 48 |
| Q Heads | 16 | 32 | 48 |
| KV Heads | 8 | 8 | 8 |
| Head Dim | 128 | 128 | 128 |
| Intermediate | 8192 | 14336 | 16384 |
| GQA Ratio | 2:1 | 4:1 | 6:1 |
| Context Length | 32,768 | 32,768 | 32,768 |
| Vocab Size | 92,544 | 92,544 | 92,544 |
Export Examples
InternLM2-1.8B
FP32 (Best quality baseline):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-1_8b \
--output ./internlm2-1.8b-cpu-fp32 \
--precision fp32 \
--execution_provider cpu
INT4 RTN (Fast quantization):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-1_8b \
--output ./internlm2-1.8b-cpu-int4 \
--precision int4 \
--execution_provider cpu
INT4 AWQ (Best quality, recommended):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-1_8b \
--output ./internlm2-1.8b-cpu-int4-awq \
--precision int4 \
--execution_provider cpu \
--extra_options int4_accuracy_level=4
InternLM2-7B
INT4 AWQ CPU (Recommended for most users):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-7b \
--output ./internlm2-7b-cpu-int4-awq \
--precision int4 \
--execution_provider cpu \
--extra_options int4_accuracy_level=4
INT4 AWQ CUDA (For GPU inference):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-7b \
--output ./internlm2-7b-cuda-int4-awq \
--precision int4 \
--execution_provider cuda \
--extra_options int4_accuracy_level=4
FP16 CUDA (Highest quality on GPU):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-7b \
--output ./internlm2-7b-cuda-fp16 \
--precision fp16 \
--execution_provider cuda
InternLM2-20B
INT4 AWQ CUDA (Recommended):
python -m onnxruntime_genai.models.builder \
--input internlm/internlm2-20b \
--output ./internlm2-20b-cuda-int4-awq \
--precision int4 \
--execution_provider cuda \
--extra_options int4_accuracy_level=4
Model Size & Performance
| Model | Original Size | INT4 Quantized | FP16 | Recommended RAM |
|---|---|---|---|---|
| InternLM2-1.8B | ~3.6 GB | ~1.0 GB | ~3.6 GB | 4 GB |
| InternLM2-7B | ~14 GB | ~3.8 GB | ~14 GB | 8 GB |
| InternLM2-20B | ~40 GB | ~10.5 GB | ~40 GB | 24 GB |
CPU Inference (Approximate):
| Model | Min RAM | Recommended RAM | Typical Speed |
|---|---|---|---|
| 1.8B INT4 | 4 GB | 8 GB | 8-12 tok/s |
| 7B INT4 | 8 GB | 16 GB | 2-4 tok/s |
| 20B INT4 | 16 GB | 32 GB | 0.5-1 tok/s |
GPU Inference (CUDA):
| Model | Min VRAM | Recommended VRAM | Typical Speed |
|---|---|---|---|
| 1.8B INT4 | 2 GB | 4 GB | 50-80 tok/s |
| 7B INT4 | 6 GB | 8 GB | 30-50 tok/s |
| 7B FP16 | 14 GB | 16 GB | 40-60 tok/s |
| 20B INT4 | 12 GB | 16 GB | 20-30 tok/s |
| 20B FP16 | 40 GB | 48 GB | 25-35 tok/s |
Inference Example
import onnxruntime_genai as og
# Works with any InternLM2 size!
model = og.Model("./internlm2-7b-cpu-int4-awq")
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
# Set generation parameters
prompt = "What is the meaning of life?"
tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.set_search_options(
max_length=200,
temperature=0.7,
top_p=0.9,
top_k=40
)
# Generate text
generator = og.Generator(model, params)
generator.append_tokens(tokens)
print(prompt, end="", flush=True)
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
print(tokenizer_stream.decode(new_token), end="", flush=True)
print()
Why Multi-Size Support Works
Architecture-Based Implementation
The implementation is size-agnostic because it:
Dynamically reads config parameters from each model:
num_attention_headsnum_key_value_headshidden_sizenum_hidden_layersintermediate_size
Uses config-driven weight splitting:
# Reads from model config num_q_heads = config.num_attention_heads # 16 for 1.8B, 32 for 7B, 48 for 20B num_kv_heads = config.num_key_value_heads # Always 8 for InternLM2 head_dim = config.hidden_size // num_q_heads # Always 128 # Calculates group size dynamically num_kv_groups = num_q_heads // num_kv_heads # 2 for 1.8B, 4 for 7B, 6 for 20B group_size = num_kv_groups + 2Handles grouped QKV layout for any GQA ratio:
- Layout:
[Group0: Q0,Q1,...,K0,V0 | Group1: Q2,Q3,...,K1,V1 | ...] - Each KV group contains multiple Q heads followed by K and V
- Correctly extracts weights regardless of the Q/KV head ratio
- Layout:
No hardcoded sizes anywhere in the code
Key Implementation Notes
Grouped QKV Layout:
- InternLM2 uses a grouped/interleaved QKV weight layout for efficient Grouped Query Attention
- The implementation in
src/python/py/models/builders/internlm.pycorrectly handles this layout during weight extraction
Model Configuration:
- The exported model uses
model_type: "llama"for ONNX Runtime GenAI compatibility - Tokenizer uses
tokenizer_class: "LlamaTokenizer"(SentencePiece-based)
Recommendations by Use Case
Development & Testing
- InternLM2-1.8B INT4 AWQ (1 GB)
- Fast iteration, quick testing
- Good for prototyping
Production Applications
- InternLM2-7B INT4 AWQ (3.8 GB)
- Best balance of quality and performance
- Suitable for most real-world applications
High-Quality Applications
- InternLM2-7B FP16 CUDA (14 GB) or
- InternLM2-20B INT4 CUDA (10.5 GB)
- Maximum quality for critical applications
Troubleshooting
"Out of Memory" errors
- Use INT4 quantization instead of FP16/FP32
- Enable GPU inference for larger models
- Use batch_size=1 for inference
Slow inference on CPU
- This is expected for 7B+ models
- Consider GPU inference
- Use INT4 quantization (2-3x faster than FP16)
Model not loading
- Ensure you have enough RAM/VRAM
- Check that you're using
--execution_provider cudafor GPU models - Verify ONNX Runtime GenAI installation
References
- Model Hub (1.8B): https://huggingface.co/internlm/internlm2-1_8b
- Model Hub (7B): https://huggingface.co/internlm/internlm2-7b
- Model Hub (20B): https://huggingface.co/internlm/internlm2-20b
- Paper: https://arxiv.org/abs/2403.17297
- GitHub: https://github.com/InternLM/InternLM