Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

University of California, Santa Cruz

The Dialogue SWE-Bench setting for evaluating coding agents through dialogue. The user never interacts with code directly and performs real-world software engineering tasks through dialogue. Coding agents operate in both the dialogue and repository environments to achieve repository-level software engineering tasks. We evaluate both patch correctness (Task Resolution) and dialogue quality.

Abstract

AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this work, we introduce Dialogue SWE-Bench, an automatic benchmark dataset for evaluating the ability of coding agents to resolve real-world software engineering problems through dialogue with a user. We design a novel, persona-grounded user simulator to support our task evaluation, and augment our task evaluation with automatic evaluations of dialogue quality. We also propose a new schema-guided agent, aimed at improving the dialogue capabilities of off-the-shelf coding agents, which improves over strong baselines by 3-14%. Our results indicate that better coding models do not always correspond to better dialogue models, suggesting that dialogue capability is a distinct and currently understudied dimension of coding agent performance.

Coding Capabilitity $\neq$ Dialogue Capability

A parallel axes plot showing the performance of different coding agents across multiple evaluation metrics. — Performance on SWE-Bench Verified (Chowdhury et al., 2024) vs. Dialogue SWE-Bench task resolution and our evaluation of Dialogue Quality. For SWE-Bench Verified, we used reported performance of coding agents using OpenHands (Wang et al., 2024), with the exception of Devstral 2, which reports a score with a custom agent (Rastogi et al., 2025). We find that **Dialogue SWE-Bench** is more challenging for coding models than SWE-Bench Verified (Chowdhury et al., 2024), and that stronger coding performance on SWE-Bench Verified does not necessarily translate to better dialogue performance on Dialogue SWE-Bench. In addition, Dialogue Quality, measured using the sum of ratings for dialogue Naturalness and Coherence (Kazi et al., 2024), is not perfectly correlated with task resolution performance, suggesting that task resolution alone is not sufficient for evaluating the usability of a coding agent.

Leaderboard

Loading…

Leaderboard of coding agents on Dialogue SWE-Bench. ^ indicate results from the experiments in our paper, which considers three agents: Openhands(Wang et al., 2024), OpenHands Interactive(Vijayvargiya et al., 2026), and our proposed Schema-Guided agent. To add a result, follow the instructions in here.

BibTeX citation

@misc{king2026dialogueswebenchbenchmarkdialoguedriven,
      title={Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents},
      author={Brendan King and Jeffrey Flanigan},
      year={2026},
      eprint={2606.13995},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.13995},
}

Bibliography

Chowdhury, N., Aung, J., Jun Shern, C., Jaffe, O., Sherburn, D., Starace, G., Mays, E., Dias, R., Aljubeh, M., Glaese, M., Jimenez, C. E., Yang, J., Ho, L., Patwardhan, T., Liu, K., & Madry, A. (2024). Introducing SWE-bench Verified. https://openai.com/index/introducing-swe-bench-verified/

Kazi, T., Lyu, R., Zhou, S., Hakkani-Tur, D., & Tur, G. (2024). Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems. arXiv. https://doi.org/10.48550/arXiv.2411.09972

Rastogi, A., Yang, A., Jiang, A. Q., Liu, A. H., Sablayrolles, A., Héliou, A., Martin, A., Agarwal, A., Ehrenberg, A., Lo, A., Roux, A., Darcet, A., Mensch, A., Bout, B., Rozière, B., Monicault, B. D., Bamford, C., Wallenwein, C., Renaudin, C., … Tang, Y. (2025). Devstral: Fine-tuning Language Models for Coding Agent Applications. https://arxiv.org/abs/2509.25193

Vijayvargiya, S., Zhou, X., Yerukola, A., Sap, M., & Neubig, G. (2026). Ambig-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering. The Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=X2yzXtH4wp

Wang, X., Li, B., Song, Y., Xu, F. F., Tang, X., Zhuge, M., Pan, J., Song, Y., Li, B., Singh, J., Tran, H. H., Li, F., Ma, R., Zheng, M., Qian, B., Shao, Y., Muennighoff, N., Zhang, Y., Hui, B., … Neubig, G. (2024). OpenHands: An Open Platform for AI Software Developers as Generalist Agents. arXiv. https://doi.org/10.48550/arXiv.2407.16741

Abstract

Coding Capabilitity ≠\neq= Dialogue Capability

Leaderboard

BibTeX citation

Bibliography

Coding Capabilitity $\neq$ Dialogue Capability