DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
Abstract
DSGym presents a standardized framework for evaluating data science agents with comprehensive task suites and execution-verified training capabilities.
Data science agents promise to accelerate discovery and insight-generation by turning data into executable analyses and findings. Yet existing data science benchmarks fall short due to fragmented evaluation interfaces that make cross-benchmark comparison difficult, narrow task coverage and a lack of rigorous data grounding. In particular, we show that a substantial portion of tasks in current benchmarks can be solved without using the actual data. To address these limitations, we introduce DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. Unlike static benchmarks, DSGym provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, positioning it as a live, extensible testbed. We curate DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and shortcut solvability filtering. We further expand coverage with (1) DSBio: expert-derived bioinformatics tasks grounded in literature and (2) DSPredict: challenging prediction tasks spanning domains such as computer vision, molecular prediction, and single-cell perturbation. Beyond evaluation, DSGym enables agent training via execution-verified data synthesis pipeline. As a case study, we build a 2,000-example training set and trained a 4B model in DSGym that outperforms GPT-4o on standardized analysis benchmarks. Overall, DSGym enables rigorous end-to-end measurement of whether agents can plan, implement, and validate data analyses in realistic scientific context.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems (2026)
- SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence (2025)
- AInsteinBench: Benchmarking Coding Agents on Scientific Repositories (2025)
- DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows (2025)
- HeurekaBench: A Benchmarking Framework for AI Co-scientist (2026)
- LongDA: Benchmarking LLM Agents for Long-Document Data Analysis (2026)
- ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper