🚀 Linguistic RL: 3B Models Exceed 100B Performance (86% vs 81%)

# :rocket: [Research] Linguistic RL: 3B Models Exceed 100B Performance Through Self-Reflection

We’ve discovered something surprising: **small models can exceed large model performance by learning from their reasoning process**.

## The Breakthrough

Using **Linguistic Reinforcement Learning** (LRL) + LoRA, we compressed Claude 3.5 Haiku’s (100B) scheduling algorithm into Qwen models - and the students exceeded the teacher:

| Model | Size | Baseline | After LRL+LoRA | vs Teacher |

|-------|------|----------|----------------|------------|

| **Qwen2.5-3B** | 3B | 12% | **86.0%** :sparkles: | **+2.0pp better** |

| **Qwen2.5-1.5B** | 1.5B | ~8% | **82.7%** | **+1.4pp better** |

| Claude 3.5 Haiku | ~100B | 81.3% | 84.0% (teacher) | baseline |

**Key insight**: The students learned an O(n log n) sweep line algorithm from natural language instruction and outperformed their 67Ă— larger teacher!

## How It Works

**Stage 1: Teacher Self-Improvement**

Claude solves problems and reflects on mistakes through “journaling”:

```

“I need to check ALL overlaps, not just adjacent meetings…”

```

Accuracy improves from 81% → 84% through pure self-reflection (no gradient updates!).

**Stage 2: Knowledge Extraction**

Extract the learned strategy as natural language curriculum.

**Stage 3: Student Training**

Fine-tune small model (3B/1.5B) with LoRA on teacher’s reasoning process.

**Result**: 3B model achieves 96% on easy problems, 88% on medium, 74% on hard.

## Why This Matters

:bullseye: **Economic Impact**

- Training cost: <$10 in API calls

- Deployment: Free, runs locally forever

- 100-1000Ă— cost reduction vs. API calls

:brain: **Technical Achievement**

- 67× compression ratio (100B → 1.5B)

- Students exceed teacher performance

- Learned algorithmic reasoning (not pattern matching)

:magnifying_glass_tilted_left: **Interpretability**

- Human-readable strategy evolution

- Auditable learning process

- No black-box distillation

## Validated Results

:white_check_mark: Fully reproducible with fixed seeds

:white_check_mark: Complete experiment logs included

:white_check_mark: 150-problem test set validation

:white_check_mark: Published to Zenodo with DOI: [10.5281/zenodo.17585532]( Algorithmic Capability Extraction at Extreme Compression: A 1.5B Parameter Model Matches Frontier Performance )

## Code & Framework

We’ve released a **universal framework** - adapt it to ANY domain:

```python

from framework import run_knowledge_transfer

results = run_knowledge_transfer(

domain=YourCustomDomain(),

teacher_model=“claude-3-5-haiku-20241022”,

student_model=“Qwen/Qwen2.5-3B-Instruct”

)

```

**GitHub**: https://github.com/DRawson5570/linguistic-rl-scheduling-experiments

## Try It Yourself

```bash

# Clone repo

git clone https://github.com/DRawson5570/linguistic-rl-scheduling-experiments

# Install

pip install transformers torch peft anthropic

# Run validated experiment

cd validated_results_qwen3b_claude35haiku

python run_validation.py

```

Requirements: 12GB GPU, Anthropic API key (~$5 to reproduce)

## Research Questions

This opens fascinating questions:

1. **Compression limits**: How small can we go?

2. **Knowledge transfer**: What kinds of reasoning compress well?

3. **Emergent capabilities**: What appears when models self-reflect?

4. **Safety**: Can we audit and verify learned strategies?

## Links

- :page_facing_up: Paper (Zenodo): Algorithmic Capability Extraction at Extreme Compression: A 1.5B Parameter Model Matches Frontier Performance

- :laptop: Code: https://github.com/DRawson5570/linguistic-rl-scheduling-experiments

- :bar_chart: 3B Results: https://github.com/DRawson5570/linguistic-rl-scheduling-experiments/tree/main/validated_results_qwen3b_claude35haiku

- :bar_chart: 1.5B Results: https://github.com/DRawson5570/linguistic-rl-scheduling-experiments/tree/main/validated_results_qwen1.5b_claude35haiku

-–

Would love to hear thoughts from the community! Has anyone tried similar approaches? What domains would benefit from this technique?

1 Like