--- language: - en tags: - pytorch - causal-lm - pythia - polypythias license: apache-2.0 datasets: - EleutherAI/pile - EleutherAI/pile-preshuffled-seeds library_name: transformers arxiv: 2503.09543 --- # PolyPythias This model is part of the **PolyPythias** suite, an extension of the [Pythia](https://github.com/EleutherAI/pythia) project providing 45 additional training runs across 5 model sizes with 9 different random seeds each. These models enable systematic study of training stability and reproducibility in language models. ## Paper **[PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs](https://arxiv.org/abs/2503.09543)** Oskar van der Wal, Pietro Lesci, Max Muller-Eberstein, Naomi Saphra, Hailey Schoelkopf, Willem Zuidema, and Stella Biderman. *ICLR 2025*. ## Model Details | Size | Parameters | Layers | Model Dim | Heads | Original Model | |------|------------|--------|-----------|-------|----------------| | 14M | 14M | 6 | 128 | 4 | [pythia-14m](https://huggingface.co/EleutherAI/pythia-14m) | | 31M | 31M | 6 | 256 | 8 | [pythia-31m](https://huggingface.co/EleutherAI/pythia-31m) | | 70M | 70M | 6 | 512 | 8 | [pythia-70m](https://huggingface.co/EleutherAI/pythia-70m) | | 160M | 160M | 12 | 768 | 12 | [pythia-160m](https://huggingface.co/EleutherAI/pythia-160m) | | 410M | 410M | 24 | 1024 | 16 | [pythia-410m](https://huggingface.co/EleutherAI/pythia-410m) | All models were trained on 300B tokens from [The Pile](https://pile.eleuther.ai/). ## Naming Convention - **`pythia-{size}m`** - Original Pythia model (seed 1234) - **`pythia-{size}m-seed{1-9}`** - PolyPythias variants with different random seeds - **`pythia-160m-data-seed{1-3}`** - 160M models with only data ordering varied (weight init fixed) - **`pythia-160m-weight-seed{1-3}`** - 160M models with only weight initialization varied (data order fixed) The decoupled seed variants (data-seed and weight-seed) allow researchers to separately study the effects of data ordering vs. weight initialization. ## Quick Start ```python from transformers import GPTNeoXForCausalLM, AutoTokenizer # Load the final checkpoint model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/pythia-70m-seed3") tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m-seed3") # Generate text inputs = tokenizer("The quick brown fox", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=20) print(tokenizer.decode(outputs[0])) ``` ## Available Checkpoints Each model provides **154 intermediate checkpoints** saved as Git branches: | Checkpoint | Training Tokens | Description | |------------|-----------------|-------------| | `step0` | 0 | Initialization (before training) | | `step1`, `step2`, `step4`, ..., `step512` | 2M - 1B | 10 log-spaced early checkpoints | | `step1000`, `step2000`, ..., `step143000` | 2B - 300B | 143 evenly-spaced checkpoints | To load a specific checkpoint: ```python model = GPTNeoXForCausalLM.from_pretrained( "EleutherAI/pythia-70m-seed3", revision="step50000", # Any checkpoint step ) ``` ## Training Data All models were trained on The Pile using pre-shuffled data orderings. The shuffled index files for each seed are available at: **[EleutherAI/pile-preshuffled-seeds](https://huggingface.co/datasets/EleutherAI/pile-preshuffled-seeds)** This dataset contains `.idx` files for seeds 0-9 used with `MMapIndexedDataset` to load the memory-mapped Pile data in the correct order for each seed. ### Reproducing Training Data Order To reproduce the exact data ordering used for a specific seed: 1. Download the Pile dataset and tokenize it using the Pythia tokenizer 2. Download the corresponding seed folder from `pile-preshuffled-seeds`: ```bash # Using huggingface_hub from huggingface_hub import snapshot_download snapshot_download( repo_id="EleutherAI/pile-preshuffled-seeds", repo_type="dataset", allow_patterns="seed3/*", # Download only seed3 local_dir="./pile-seeds" ) ``` 3. Use the idx files with GPT-NeoX's `MMapIndexedDataset`: ```python from dataset import MMapIndexedDataset dataset = MMapIndexedDataset(path_prefix, skip_warmup=True) ``` For complete training reproduction instructions, see the [Pythia GitHub repository](https://github.com/EleutherAI/pythia). ## All PolyPythias Models The complete collection is available at: [EleutherAI/polypythias](https://huggingface.co/collections/EleutherAI/polypythias) ### 14M Parameter Models - [pythia-14m-seed1](https://huggingface.co/EleutherAI/pythia-14m-seed1) through [pythia-14m-seed9](https://huggingface.co/EleutherAI/pythia-14m-seed9) ### 31M Parameter Models - [pythia-31m-seed1](https://huggingface.co/EleutherAI/pythia-31m-seed1) through [pythia-31m-seed9](https://huggingface.co/EleutherAI/pythia-31m-seed9) ### 70M Parameter Models - [pythia-70m-seed1](https://huggingface.co/EleutherAI/pythia-70m-seed1) through [pythia-70m-seed9](https://huggingface.co/EleutherAI/pythia-70m-seed9) ### 160M Parameter Models - [pythia-160m-seed1](https://huggingface.co/EleutherAI/pythia-160m-seed1) through [pythia-160m-seed9](https://huggingface.co/EleutherAI/pythia-160m-seed9) - [pythia-160m-data-seed1](https://huggingface.co/EleutherAI/pythia-160m-data-seed1) through [pythia-160m-data-seed3](https://huggingface.co/EleutherAI/pythia-160m-data-seed3) - [pythia-160m-weight-seed1](https://huggingface.co/EleutherAI/pythia-160m-weight-seed1) through [pythia-160m-weight-seed3](https://huggingface.co/EleutherAI/pythia-160m-weight-seed3) ### 410M Parameter Models - [pythia-410m-seed1](https://huggingface.co/EleutherAI/pythia-410m-seed1) through [pythia-410m-seed9](https://huggingface.co/EleutherAI/pythia-410m-seed9) ## Evaluation Results Evaluation results for all models are available in the [polypythias-evals](https://huggingface.co/datasets/EleutherAI/polypythias-evals) dataset. ## Limitations These models are released for research purposes only. They are **not** intended for deployment in production systems. - **Not instruction-tuned**: These are base language models that predict the next token; they will not follow instructions like ChatGPT - **May generate harmful content**: The Pile contains diverse internet text that includes biased, offensive, and factually incorrect content - **English only**: Models were trained primarily on English text - **No safety filtering**: Outputs are not filtered for safety or accuracy ## License Apache 2.0 ## Contact For questions about these models, please use: - [EleutherAI Discord](https://discord.gg/eleutherai) - #release-discussion channel - [GitHub Issues](https://github.com/EleutherAI/pythia/issues) ## Citation If you use these models, please cite: ```bibtex @inproceedings{vanderwal2025polypythias, title={PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs}, author={van der Wal, Oskar and Lesci, Pietro and Muller-Eberstein, Max and Saphra, Naomi and Schoelkopf, Hailey and Zuidema, Willem and Biderman, Stella}, booktitle={International Conference on Learning Representations}, year={2025}, url={https://arxiv.org/abs/2503.09543} } ```