Prompt-R1

Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

Wenjin Liu, Haoran Luo, Xueyuan Lin, Haoming Liu, Tiesunlong Shen, Jiapu Wang, Rui Mao, Erik Cambria

arXiv Github 🤗 Dataset 🤗 HF Models

An example of Prompt-R1 agent working with a large-scale LLM to answer a question. The agent obtains the correct answer step by step through multi-turn prompt interaction.

Comparison of prompting and optimization methods

Comparison of different methods for improving prompt understanding and task adaptability of LLMs: human–LLM interaction, prompt engineering, fine-tuning optimization, and the collaborative automatic prompting interaction framework Prompt-R1.

Overview of the Prompt-R1 framework. A small-scale LLM acts as an agent, interacting with a large-scale LLM environment through multi-turn prompts. A double-constrained reward is used to optimize the prompt agent via end-to-end reinforcement learning.

Performance comparison of Prompt-R1 with state-of-the-art prompting and fine-tuning methods on eight public training datasets, including multi-hop reasoning (2WikiMultihopQA, HotpotQA), mathematical computation (GSM8K, DAPO Math), standard QA (MusiQue, PopQA), and text generation tasks (BookSum, WritingPrompts). Δ↑ denotes the improvement over the GPT-4o-mini baseline; higher values indicate stronger performance. Bold numbers mark the best results, and all metrics are reported in percentage (%).

Prompt-R1 out-of-distribution comparison table

Performance comparison of Prompt-R1 with selected baselines on four out-of-distribution datasets across multiple tasks, including multi-hop reasoning (TriviaQA), mathematical computation (MathQA), standard QA (SQuAD v2), and text generation (XSum). Higher values indicate improved robustness and generalization beyond in-distribution training data.

Radar chart: LLM performance with and without Prompt-R1

Comparison of six large language models with and without the Prompt-R1 agent across eight datasets. Metrics include F1 for multi-hop reasoning (2Wiki, Hotpot) and standard QA (MusiQue, PopQA), EM for mathematical computation (GSM8K, DAPO), and SSim for text generation (BookSum, WritingPrompts). Prompt-R1 brings consistent improvements across all tasks.

Average EM, F1, and SSim scores for six LLMs on OOD datasets, with and without Prompt-R1 agent

Comparison of the average EM, F1, and SSim scores for six large language models on out-of-distribution datasets, together with the averages obtained when these models are equipped with the Prompt-R1 agent. The hatched bars indicate the performance after applying Prompt-R1, showing consistent gains across all metrics.

Ablation study of Prompt-R1 under different configurations

Ablation analysis of Prompt-R1 (using GPT-4o-mini as the environment), evaluated on both in-distribution and out-of-distribution datasets. The study compares the full Prompt-R1 system with three reduced variants: removing the LLM-as-Environment component (w/o Env.), disabling reinforcement learning (w/o R.L.), and removing the Prompt-R1 agent (w/o Agent). The full model consistently achieves the strongest performance across all metrics.

Training process, interaction turns, and comparisons of EM, F1, and SSim for Prompt-R1

(a–b) Training process and interaction turns of the Prompt-R1 agent under different environments (GPT-4o-mini and GPT-OSS-20B). (c–e) Comparison of various Prompt-R1 variants with baselines on average EM, average F1, and average SSim across in-distribution datasets, showing consistent improvements brought by the Prompt-R1 agent.

Abstract

Advanced large language models are rapidly emerging, but their performance in real applications is often constrained by users’ inability to craft high-quality prompts, especially for complex tasks. Prompt-R1 addresses this challenge with an end-to-end reinforcement learning framework in which a small-scale LLM collaborates with large-scale LLMs through multi-turn prompt interactions.

The small model is treated as a prompt agent that thinks and generates prompts, while the large model performs complex reasoning and produces task outputs. A dual-constrained reward is designed to jointly optimize answer correctness, prompt generation quality, and reasoning accuracy. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms strong baselines across tasks, demonstrating the effectiveness of collaborative automatic prompting.

Paper

BibTeX

@misc{liu2025promptr1collaborativeautomaticprompting,
  title         = {Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning},
  author        = {Wenjin Liu and Haoran Luo and Xueyuan Lin and Haoming Liu and
                   Tiesunlong Shen and Jiapu Wang and Rui Mao and Erik Cambria},
  year          = {2025},
  eprint        = {2511.01016},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2511.01016}
}