YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

InternLM2 Model Export for ONNX Runtime GenAI

This example demonstrates how to export InternLM2 models to ONNX format using ONNX Runtime GenAI.

Supported Models

All InternLM2 model sizes are supported:

  • ✅ InternLM2-1.8B - Tested and verified
  • ✅ InternLM2-7B - Tested and verified
  • ✅ InternLM2-20B - Fully compatible
  • ✅ InternLM2-Chat variants - All sizes supported

The implementation is architecture-based and automatically adapts to any InternLM2 model size.

Model Architecture

InternLM2 uses a Llama-based architecture with the following key features:

  • Attention: Grouped Query Attention (GQA) with grouped/interleaved QKV layout
  • Normalization: RMSNorm (eps: 1e-05)
  • Activation: SiLU
  • Positional Encoding: RoPE with theta=1,000,000

Architecture Specifications

Parameter 1.8B 7B 20B
Hidden Size 2048 4096 6144
Num Layers 24 32 48
Q Heads 16 32 48
KV Heads 8 8 8
Head Dim 128 128 128
Intermediate 8192 14336 16384
GQA Ratio 2:1 4:1 6:1
Context Length 32,768 32,768 32,768
Vocab Size 92,544 92,544 92,544

Export Examples

InternLM2-1.8B

FP32 (Best quality baseline):

python -m onnxruntime_genai.models.builder \
    --input internlm/internlm2-1_8b \
    --output ./internlm2-1.8b-cpu-fp32 \
    --precision fp32 \
    --execution_provider cpu

INT4 RTN (Fast quantization):

python -m onnxruntime_genai.models.builder \
    --input internlm/internlm2-1_8b \
    --output ./internlm2-1.8b-cpu-int4 \
    --precision int4 \
    --execution_provider cpu

INT4 AWQ (Best quality, recommended):

python -m onnxruntime_genai.models.builder \
    --input internlm/internlm2-1_8b \
    --output ./internlm2-1.8b-cpu-int4-awq \
    --precision int4 \
    --execution_provider cpu \
    --extra_options int4_accuracy_level=4

InternLM2-7B

INT4 AWQ CPU (Recommended for most users):

python -m onnxruntime_genai.models.builder \
    --input internlm/internlm2-7b \
    --output ./internlm2-7b-cpu-int4-awq \
    --precision int4 \
    --execution_provider cpu \
    --extra_options int4_accuracy_level=4

INT4 AWQ CUDA (For GPU inference):

python -m onnxruntime_genai.models.builder \
    --input internlm/internlm2-7b \
    --output ./internlm2-7b-cuda-int4-awq \
    --precision int4 \
    --execution_provider cuda \
    --extra_options int4_accuracy_level=4

FP16 CUDA (Highest quality on GPU):

python -m onnxruntime_genai.models.builder \
    --input internlm/internlm2-7b \
    --output ./internlm2-7b-cuda-fp16 \
    --precision fp16 \
    --execution_provider cuda

InternLM2-20B

INT4 AWQ CUDA (Recommended):

python -m onnxruntime_genai.models.builder \
    --input internlm/internlm2-20b \
    --output ./internlm2-20b-cuda-int4-awq \
    --precision int4 \
    --execution_provider cuda \
    --extra_options int4_accuracy_level=4

Model Size & Performance

Model Original Size INT4 Quantized FP16 Recommended RAM
InternLM2-1.8B ~3.6 GB ~1.0 GB ~3.6 GB 4 GB
InternLM2-7B ~14 GB ~3.8 GB ~14 GB 8 GB
InternLM2-20B ~40 GB ~10.5 GB ~40 GB 24 GB

CPU Inference (Approximate):

Model Min RAM Recommended RAM Typical Speed
1.8B INT4 4 GB 8 GB 8-12 tok/s
7B INT4 8 GB 16 GB 2-4 tok/s
20B INT4 16 GB 32 GB 0.5-1 tok/s

GPU Inference (CUDA):

Model Min VRAM Recommended VRAM Typical Speed
1.8B INT4 2 GB 4 GB 50-80 tok/s
7B INT4 6 GB 8 GB 30-50 tok/s
7B FP16 14 GB 16 GB 40-60 tok/s
20B INT4 12 GB 16 GB 20-30 tok/s
20B FP16 40 GB 48 GB 25-35 tok/s

Inference Example

import onnxruntime_genai as og

# Works with any InternLM2 size!
model = og.Model("./internlm2-7b-cpu-int4-awq")
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

# Set generation parameters
prompt = "What is the meaning of life?"
tokens = tokenizer.encode(prompt)

params = og.GeneratorParams(model)
params.set_search_options(
    max_length=200,
    temperature=0.7,
    top_p=0.9,
    top_k=40
)

# Generate text
generator = og.Generator(model, params)
generator.append_tokens(tokens)

print(prompt, end="", flush=True)
while not generator.is_done():
    generator.generate_next_token()
    new_token = generator.get_next_tokens()[0]
    print(tokenizer_stream.decode(new_token), end="", flush=True)
print()

Why Multi-Size Support Works

Architecture-Based Implementation

The implementation is size-agnostic because it:

  1. Dynamically reads config parameters from each model:

    • num_attention_heads
    • num_key_value_heads
    • hidden_size
    • num_hidden_layers
    • intermediate_size
  2. Uses config-driven weight splitting:

    # Reads from model config
    num_q_heads = config.num_attention_heads  # 16 for 1.8B, 32 for 7B, 48 for 20B
    num_kv_heads = config.num_key_value_heads  # Always 8 for InternLM2
    head_dim = config.hidden_size // num_q_heads  # Always 128
    
    # Calculates group size dynamically
    num_kv_groups = num_q_heads // num_kv_heads  # 2 for 1.8B, 4 for 7B, 6 for 20B
    group_size = num_kv_groups + 2
    
  3. Handles grouped QKV layout for any GQA ratio:

    • Layout: [Group0: Q0,Q1,...,K0,V0 | Group1: Q2,Q3,...,K1,V1 | ...]
    • Each KV group contains multiple Q heads followed by K and V
    • Correctly extracts weights regardless of the Q/KV head ratio
  4. No hardcoded sizes anywhere in the code

Key Implementation Notes

Grouped QKV Layout:

  • InternLM2 uses a grouped/interleaved QKV weight layout for efficient Grouped Query Attention
  • The implementation in src/python/py/models/builders/internlm.py correctly handles this layout during weight extraction

Model Configuration:

  • The exported model uses model_type: "llama" for ONNX Runtime GenAI compatibility
  • Tokenizer uses tokenizer_class: "LlamaTokenizer" (SentencePiece-based)

Recommendations by Use Case

Development & Testing

  • InternLM2-1.8B INT4 AWQ (1 GB)
  • Fast iteration, quick testing
  • Good for prototyping

Production Applications

  • InternLM2-7B INT4 AWQ (3.8 GB)
  • Best balance of quality and performance
  • Suitable for most real-world applications

High-Quality Applications

  • InternLM2-7B FP16 CUDA (14 GB) or
  • InternLM2-20B INT4 CUDA (10.5 GB)
  • Maximum quality for critical applications

Troubleshooting

"Out of Memory" errors

  • Use INT4 quantization instead of FP16/FP32
  • Enable GPU inference for larger models
  • Use batch_size=1 for inference

Slow inference on CPU

  • This is expected for 7B+ models
  • Consider GPU inference
  • Use INT4 quantization (2-3x faster than FP16)

Model not loading

  • Ensure you have enough RAM/VRAM
  • Check that you're using --execution_provider cuda for GPU models
  • Verify ONNX Runtime GenAI installation

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for onnx-community/InternLM2-ONNX