The Dialogue SWE-Bench setting for evaluating coding agents through dialogue. The user never interacts with code directly and performs real-world software engineering tasks through dialogue. Coding agents operate in both the dialogue and repository environments to achieve repository-level software engineering tasks. We evaluate both patch correctness (Task Resolution) and dialogue quality.
Abstract
AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this work, we introduce Dialogue SWE-Bench, an automatic benchmark dataset for evaluating the ability of coding agents to resolve real-world software engineering problems through dialogue with a user. We design a novel, persona-grounded user simulator to support our task evaluation, and augment our task evaluation with automatic evaluations of dialogue quality. We also propose a new schema-guided agent, aimed at improving the dialogue capabilities of off-the-shelf coding agents, which improves over strong baselines by 3-14%. Our results indicate that better coding models do not always correspond to better dialogue models, suggesting that dialogue capability is a distinct and currently understudied dimension of coding agent performance.
Coding Capabilitity Dialogue Capability
Performance on SWE-Bench Verified (Chowdhury et al., 2024) vs. Dialogue SWE-Bench task resolution and our evaluation of Dialogue Quality. For SWE-Bench Verified, we used reported performance of coding agents using OpenHands (Wang et al., 2024), with the exception of Devstral 2, which reports a score with a custom agent (Rastogi et al., 2025). We find that Dialogue SWE-Bench is more challenging for coding models than SWE-Bench Verified (Chowdhury et al., 2024), and that stronger coding performance on SWE-Bench Verified does not necessarily translate to better dialogue performance on Dialogue SWE-Bench. In addition, Dialogue Quality, measured using the sum of ratings for dialogue Naturalness and Coherence (Kazi et al., 2024), is not perfectly correlated with task resolution performance, suggesting that task resolution alone is not sufficient for evaluating the usability of a coding agent.
Leaderboard
Loading…
Leaderboard of coding agents on Dialogue SWE-Bench. ^ indicate results from the experiments in our paper, which considers three agents: Openhands(Wang et al., 2024), OpenHands Interactive(Vijayvargiya et al., 2026), and our proposed Schema-Guided agent. To add a result, follow the instructions in here.
BibTeX citation
@misc{king2026dialogueswebenchbenchmarkdialoguedriven, title={Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents}, author={Brendan King and Jeffrey Flanigan}, year={2026}, eprint={2606.13995}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2606.13995},}