Synthesized by NotebookLM to help you understand the PRIMT paper.
Preference-based reinforcement learning (PbRL) has emerged as a promising paradigm for teaching robots complex behaviors without reward engineering. However, its effectiveness is often limited by two critical challenges: the reliance on extensive human input and the inherent difficulties in resolving query ambiguity and credit assignment during reward learning. In this paper, we introduce PRIMT, a PbRL framework designed to overcome these challenges by leveraging foundation models (FMs) for multimodal synthetic feedback and trajectory synthesis. Unlike prior approaches that rely on single-modality FM evaluations, PRIMT employs a hierarchical neuro-symbolic fusion strategy, integrating the complementary strengths of large language models and vision-language models in evaluating robot behaviors for more reliable and comprehensive feedback. PRIMT also incorporates foresight trajectory generation, which reduces early-stage query ambiguity by warm-starting the trajectory buffer with bootstrapped samples, and hindsight trajectory augmentation, which enables counterfactual reasoning with a causal auxiliary loss to improve credit assignment. We evaluate PRIMT on 2 locomotion and 6 manipulation tasks on various benchmarks, demonstrating superior performance over FM-based and scripted baselines.
Preference-based RL offers an alternative to hand-designed rewards by learning from comparative feedback; yet its reliance on human labels prevents scalability. Our goal is to enable scalable, zero-shot PbRL with foundation models (FMs). We identify three critical challenges in existing preference-based RL frameworks:
Vision-based reasoning → reliable spatial grounding and goal-state assessment, but limited ability to interpret temporal progression or subtle motion dynamics.
Text-centric analysis → good temporal and logical reasoning, but often hallucinate or miss fine-grained spatial interactions and key events.
Early trajectories from random policies are uniformly low quality, lacking meaningful task variations → cannot provide informative comparisons.
Preferences are given at trajectory level, but reward models operate at state-action level → hard to determine which steps caused the preference.
Overview of the PRIMT, comprising of two synergistic modules: 1) Hierarchical neuro-symbolic preference fusion improves the quality and reliability of synthetic feedback by leveraging the complementary collective intelligence of VLMs and LLMs for multimodal evaluation of robot behaviors; and 2) Bidirectional trajectory synthesis consists of foresight trajectory generation, which bootstraps the trajectory buffer to mitigate early-stage query ambiguity, and hindsight trajectory augmentation, which applies SCM-based counterfactual reasoning to improve reward learning with a causal auxiliary loss that enables fine-grained credit assignment.
Click to expand
Click to expand
Click to expand
- Better Task Performance; - Improved Synthetic Feedback Quality; - Enhanced Credit Assignment; - Better Cost-Performance Efficiency
Learning curves of PRIMT and baseline methods across all tasks
Ablation study on FM backbone selection.
Distribution of preference labels, showing the proportion of correct, incorrect, and indecisive labels across different methods.
Reward alignment analysis, comparing the learned reward outputs of PRIMT, ablations, and baselines against ground-truth rewards.
Cost-performance trade-off comparison of PRIMT against baseline methods
PRIMT
PRIMT
PRIMT
PRIMT
PRIMT
PRIMT
PRIMT
PRIMT