#
[Research] Linguistic RL: 3B Models Exceed 100B Performance Through Self-Reflection
We’ve discovered something surprising: **small models can exceed large model performance by learning from their reasoning process**.
## The Breakthrough
Using **Linguistic Reinforcement Learning** (LRL) + LoRA, we compressed Claude 3.5 Haiku’s (100B) scheduling algorithm into Qwen models - and the students exceeded the teacher:
| Model | Size | Baseline | After LRL+LoRA | vs Teacher |
|-------|------|----------|----------------|------------|
| **Qwen2.5-3B** | 3B | 12% | **86.0%**
| **+2.0pp better** |
| **Qwen2.5-1.5B** | 1.5B | ~8% | **82.7%** | **+1.4pp better** |
| Claude 3.5 Haiku | ~100B | 81.3% | 84.0% (teacher) | baseline |
**Key insight**: The students learned an O(n log n) sweep line algorithm from natural language instruction and outperformed their 67Ă— larger teacher!
## How It Works
**Stage 1: Teacher Self-Improvement**
Claude solves problems and reflects on mistakes through “journaling”:
```
“I need to check ALL overlaps, not just adjacent meetings…”
```
Accuracy improves from 81% → 84% through pure self-reflection (no gradient updates!).
**Stage 2: Knowledge Extraction**
Extract the learned strategy as natural language curriculum.
**Stage 3: Student Training**
Fine-tune small model (3B/1.5B) with LoRA on teacher’s reasoning process.
**Result**: 3B model achieves 96% on easy problems, 88% on medium, 74% on hard.
## Why This Matters
**Economic Impact**
- Training cost: <$10 in API calls
- Deployment: Free, runs locally forever
- 100-1000Ă— cost reduction vs. API calls
**Technical Achievement**
- 67× compression ratio (100B → 1.5B)
- Students exceed teacher performance
- Learned algorithmic reasoning (not pattern matching)
**Interpretability**
- Human-readable strategy evolution
- Auditable learning process
- No black-box distillation
## Validated Results
Fully reproducible with fixed seeds
Complete experiment logs included
150-problem test set validation
Published to Zenodo with DOI: [10.5281/zenodo.17585532]( Algorithmic Capability Extraction at Extreme Compression: A 1.5B Parameter Model Matches Frontier Performance )
## Code & Framework
We’ve released a **universal framework** - adapt it to ANY domain:
```python
from framework import run_knowledge_transfer
results = run_knowledge_transfer(
domain=YourCustomDomain(),
teacher_model=“claude-3-5-haiku-20241022”,
student_model=“Qwen/Qwen2.5-3B-Instruct”
)
```
**GitHub**: https://github.com/DRawson5570/linguistic-rl-scheduling-experiments
## Try It Yourself
```bash
# Clone repo
git clone https://github.com/DRawson5570/linguistic-rl-scheduling-experiments
# Install
pip install transformers torch peft anthropic
# Run validated experiment
cd validated_results_qwen3b_claude35haiku
python run_validation.py
```
Requirements: 12GB GPU, Anthropic API key (~$5 to reproduce)
## Research Questions
This opens fascinating questions:
1. **Compression limits**: How small can we go?
2. **Knowledge transfer**: What kinds of reasoning compress well?
3. **Emergent capabilities**: What appears when models self-reflect?
4. **Safety**: Can we audit and verify learned strategies?
## Links
-
Paper (Zenodo): Algorithmic Capability Extraction at Extreme Compression: A 1.5B Parameter Model Matches Frontier Performance
-
Code: https://github.com/DRawson5570/linguistic-rl-scheduling-experiments
-
3B Results: https://github.com/DRawson5570/linguistic-rl-scheduling-experiments/tree/main/validated_results_qwen3b_claude35haiku
-
1.5B Results: https://github.com/DRawson5570/linguistic-rl-scheduling-experiments/tree/main/validated_results_qwen1.5b_claude35haiku
-–
Would love to hear thoughts from the community! Has anyone tried similar approaches? What domains would benefit from this technique?