Most agent debugging today is theatre: “trust me, it’s the same prompt” and “it worked on my run.” In production, runs diverge — sampling jitter, tool timing, memory writes, hidden state, flaky endpoints, and plain old nondeterminism. Benchmarks tell you you failed. Single logs tell you what happened once. What you actually need is a diff: where did the timelines first split, and what changed?
I built TimelineDiff — Differential Reproducibility to do exactly that. Upload two DRP trace bundles (.zip) and it will:
• Align both timelines event-by-event
• Identify the first divergence step (the moment reality splits)
• Show the delta: missing events, changed tool outputs, memory mutations, control-flow differences
• Export a shareable evidence pack (so you can stop arguing and start fixing)
Space: TimelineDiff Differential Reproducibility - a Hugging Face Space by RFTSystems
If you’re shipping agents, eval tooling, or anything that relies on “reproducible” behaviour: run TimelineDiff on two sessions you swear are the same. You’ll find the split fast, and you’ll have receipts you can hand to a teammate, a reviewer, or a client.
RFTSystems,Liam
⸻
#reproducibility #mlops #observability #agenticAI #aiSafety #evals #debugging #forensics #traceability #llm #RFTSystems #TrustStack #verification