Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces Paper • 2601.11868 • Published 7 days ago • 17
mlx-community/XortronCriminalComputingConfig-mlx-8Bit Text Generation • Updated Jun 19, 2025 • 17 • 3
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development Paper • 2601.11077 • Published 8 days ago • 62
Deriving Character Logic from Storyline as Codified Decision Trees Paper • 2601.10080 • Published 9 days ago • 6
Lost in the Noise: How Reasoning Models Fail with Contextual Distractors Paper • 2601.07226 • Published 12 days ago • 30
OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs Paper • 2601.01592 • Published 19 days ago • 12