Swe-bench Performance Reaches 50.8% Without Tool Use: A Case For Monolithic State-in-context Agents

11 hours ago

ARTICLE AD BOX

Recent advancements successful LM agents person shown promising imaginable for automating intricate real-world tasks. These agents typically run by proposing and executing actions done APIs, supporting applications specified arsenic package engineering, robotics, and technological experimentation. As these tasks go much complex, LM supplier frameworks person evolved to see aggregate agents, multi-step retrieval, and tailored scaffolding to optimize performance. A cardinal situation lies successful efficaciously exploring and knowing nan environment, which has prompted nan improvement of engineered scaffolds utilizing tools, representation mechanisms, and civilization pipelines. However, astir existing methods presume partial observability, requiring agents to cod observations incrementally. While this presumption holds successful move aliases unfamiliar environments, it is little applicable successful afloat observable settings for illustration SWE-bench, wherever each applicable accusation is accessible from nan start.

In package engineering, investigation connected LM agents has focused connected 2 main strategies: agent-based frameworks and system pipelines. Agent-based systems, specified arsenic SWE-Agent and OpenHands CodeAct, let LMs to interact autonomously pinch codebases, often done civilization interfaces and retrieval tools. Other models for illustration Moatless and AutoCodeRover heighten localization done hunt techniques, while SpecRover refines scaffolding design. Alternatively, system pipelines—such arsenic Agentless and CodeMonkey—decompose tasks into sequential phases for illustration localization, repair, and validation. While these approaches dangle connected engineered components for performance, nan existent study proposes leveraging Long-Context LMs (LCLMs) to straight construe nan full task environment. Advances successful LCLM architecture and infrastructure now let these models to outperform retrieval-augmented systems successful galore contexts, reducing reliance connected analyzable outer scaffolding.

Researchers from Stanford, IBM, and nan University of Toronto explored whether analyzable scaffolding is basal for LM agents tackling tasks for illustration SWE-bench. They show that simply utilizing LCLMs, specified arsenic Gemini-1.5-Pro, pinch due prompting and nary scaffolding, tin execute competitory performance—reaching 38% connected SWE-Bench-Verified. Gemini-2.5-Pro, utilizing nan aforesaid elemental setup, reaches 50.8%. Their activity suggests that galore analyzable agentic designs could beryllium replaced pinch a azygous powerful LCLM, simplifying architecture and training. Additionally, a hybrid two-stage attack utilizing Gemini-1.5-Pro and Claude-3.7 achieves a 48.6% lick rate, further supporting this simplified direction.

Traditional LM agents trust connected interactive exploration owed to partial observability, but galore tasks, for illustration package debugging, let afloat observability. The study proposes state-in-context agents that leverage LCLMs to straight process afloat aliases compressed situation states, bypassing nan request for analyzable agentic scaffolding. For ample codebases, a ranking-based compression selects applicable files to fresh wrong discourse limits. Two methods are introduced: DIRECTSOLVE, wherever LCLMs lick tasks utilizing nan afloat context; and SELECTSOLVE, wherever LCLMs localize applicable files for short-context LMs (SCLMs) to solve. Both usage targeted spot formats and validation to guarantee accuracy and trim hallucination.

The experiments measure a simplified supplier model utilizing LLMs connected nan SWE-bench Verified benchmark, which includes 500 real-world package engineering tasks. The projected methods, DIRECTSOLVE and SELECTSOLVE, utilize LCLMs for illustration Gemini-1.5-Pro and Gemini-2.5-Pro, and successful SELECTSOLVE, an further SCLM (Claude-3.7-Sonnet) for spot generation. Results show that DIRECTSOLVE outperforms analyzable agentic approaches for illustration Agentless and CodeAct pinch minimal engineering. SELECTSOLVE further improves accuracy by leveraging stronger models for patching. Ablation studies item nan value of CoT prompting, codification restatement, and token-efficient discourse design. Additionally, positioning applicable files astatine nan commencement of nan punctual improves performance, underscoring limitations successful long-context processing.

In conclusion, nan costs of utilizing LCLM-based methods is presently higher than existing approaches for illustration Agentless and CodeAct, averaging $2.60 per lawsuit compared to $0.25 and $0.87, respectively. However, accelerated drops successful conclusion costs and expanding discourse lengths make LCLMs much practical. Techniques for illustration KV caching importantly little costs aft first runs, reducing it to astir $0.725. Although flimsy codebase changes still limit caching benefits, further improvements could help. The study besides suggests that LCLMs tin grip agelong relationship histories, reducing nan request for analyzable representation and retrieval mechanisms. Notably, unscaffolded LCLM models tin execute competitively connected SWE-bench tasks.

Check retired nan Paper. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 90k+ ML SubReddit.

Sana Hassan, a consulting intern astatine Marktechpost and dual-degree student astatine IIT Madras, is passionate astir applying exertion and AI to reside real-world challenges. With a keen liking successful solving applicable problems, he brings a caller position to nan intersection of AI and real-life solutions.