Reinforcement Learning, Not Fine-tuning: Nemotron-tool-n1 Trains Llms To Use Tools With Minimal Supervision And Maximum Generalization

Trending 10 hours ago
ARTICLE AD BOX

Equipping LLMs pinch outer devices aliases functions has go popular, showing awesome capacity crossed divers domains. Existing investigation depends connected synthesizing ample volumes of tool-use trajectories done precocious connection models and SFT to heighten LLMs’ tool-calling capability. The captious limitation lies successful nan synthetic datasets’ inability to seizure definitive reasoning steps, resulting successful superficial instrumentality telephone training. In galore cases, reasoning is either wholly omitted during nan training aliases deferred to conclusion done prompting techniques. This results successful pseudo-reasoning: models simply study to mimic surface-level patterns without genuinely knowing nan underlying decision-making process.

Existing investigation explores aggregate approaches to heighten LLMs’ tool-use capabilities. Previous methods person focused connected 2 cardinal strategies for improving instrumentality learning. The first attack concentrated connected dataset curation and exemplary refinement, involving nan creation of large-scale supervised datasets and applying precocious training techniques specified arsenic SFT and DPO reinforcement learning. LLMs are mixed pinch various outer tools, including hunt engines, calculators, imagination tools, and Python interpreters, to grow their functional capabilities. The 2nd attack targeted reasoning improvement, shifting from accepted train-time scaling to much analyzable test-time scaling strategies. Earlier methods relied connected step-level supervision and learned reward models to guideline reasoning trajectories.

Researchers from NVIDIA, Pennsylvania State University, and nan University of Washington person projected nan Nemotron-Research-Tool-N1 bid to reside nan limitations of existing tool-use methods. It diverges from accepted SFT and reasoning trace distillation techniques by implementing a unsocial RL paradigm. Drawing inspiration from DeepSeek-R1’s success, a lightweight supervision method has been developed to attraction connected nan structural validity and functional correctness information of instrumentality invocations. The Nemotron-Research-Tool-N1 exemplary employs a binary reward system that enables nan exemplary to autonomously create reasoning strategies without relying connected explicitly annotated reasoning trajectories.

Researchers unify and preprocess information from existing tool-calling datasets, xLAM, and a subset of ToolACE, which supply single-turn and multi-turn synthetic tool-calling trajectories. A lightweight prompting template is created to guideline instrumentality telephone generation, featuring definitive instructions for intermediate reasoning wrong <think>…</think> tags and instrumentality invocation enclosed successful <tool_call>…</tool_call>. The template helps to minimize rigid formatting constraints and trim nan consequence of overfitting to circumstantial punctual patterns. The superior backbone exemplary utilized is Qwen2.5-7B/14B-Instruct, and to measure nan generalization expertise of nan projected method, evaluations are performed connected replacement backbone models, including aggregate variants from nan LLaMA family.

Results connected nan BFCL and API-Bank benchmarks show Nemotron-Research-Tool-N1 models’ superior performance. On nan BFCL benchmark, nan Tool-N1-7B/14B models outperform closed-source models for illustration GPT-4o and specialized fine-tuned models specified arsenic xLAM-2-70B and ToolACE-8B. The models surpass SFT baselines trained connected identical information sources, highlighting nan effectiveness of nan R1-style RL approach. Further, nan API-Bank benchmark validates these findings, pinch Tool-N1-7B/14B achieving 4.12% and 5.03% higher accuracy than GPT-4o. These results conclusively show nan imaginable of nan projected method successful enhancing ample connection models’ tool-calling capabilities done a caller reinforcement learning paradigm.

In conclusion, researchers introduced Nemotron-Research-Tool-N1, a important advancement successful LLM tool-use capabilities. The investigation shows a paradigm displacement from accepted SFT methodologies by introducing a caller rule-based RL approach. The projected method enables models to create blase reasoning strategies without relying connected explicitly annotated reasoning trajectories. Benchmark evaluations crossed BFCL and API-Bank consistently validate nan approach’s effectiveness, showing important capacity improvements complete existing baselines. The findings unfastened caller avenues for processing much adaptable and intelligent connection models that tin autonomously make reasoning strategies.


Check out the Paper and GitHub Page. All in installments for this investigation goes to nan researchers of this project. Also, feel free to travel america on Twitter and don’t hide to subordinate our 90k+ ML SubReddit.

Here’s a little overview of what we’re building astatine Marktechpost:

  • ML News Community – r/machinelearningnews (92k+ members)
  • Newsletter– airesearchinsights.com/(30k+ subscribers)
  • miniCON AI Events – minicon.marktechpost.com
  • AI Reports & Magazines – magazine.marktechpost.com
  • AI Dev & Research News – marktechpost.com (1M+ monthly readers)
  • Partner pinch us

Sajjad Ansari is simply a last twelvemonth undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into nan applicable applications of AI pinch a attraction connected knowing nan effect of AI technologies and their real-world implications. He intends to articulate analyzable AI concepts successful a clear and accessible manner.

More