Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress in developing agents that can perform parts of the research pipeline in machine learning, including NLP. However, the ability of these agents to reliably produce code that yields accurate research results has not yet been adequately assessed. In my talk, I will introduce a new benchmark called “REXBench” that evaluates the ability of LLM-based coding agents to autonomously implement novel research extensions. I will argue that research extensions are an ideal testing ground for evaluating such agents and explain how our benchmark circumvents common data contamination issues. I will also present results from evaluating thirteen recent LLM-based agents and discuss their implications for using LLM agents to write research code for NLP and machine learning.


test
Theory 2 – Sebastian Schuster: “Can coding agents autonomously implement NLP research extensions?”
Speakers
Schedule
24 November 2025
16:15 - 16:45