LogoUni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback

ICLR 2024

1Tianjin University; 2Huawei Noah's Ark Lab; Corresponding Author

Abstract

Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 32 popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback.

Universal Platform for Reinforcement Learning with Diverse Feedback Types

To align RLHF methodologies with practical problems and cater to researchers’ needs for systematic studies of various feedback types within a unified context, we introduce the Uni-RLHF system. We start with the universal annotation platform, which supports various types of human feedback along with a standardized encoding format for diverse human feedback. Using this platform, we have established a pipeline for crowdsourced feedback collection and filtering, amassing large-scale crowdsourced labeled datasets and setting a unified research benchmark.



Implementation For Multi-feedback Annotation Platform


Overview of the Uni-RLHF system. Uni-RLHF consists of three components including the platform, the datasets, and the offline RLHF baselines. Uni-RLHF packs up abstractions for RLHF annotation workflow, where the essentials include: ① interfaces supporting a wide range of online environments and offline datasets, ② a query sampler that determines which data to display, ③ an interactive user interface, enabling annotators to view available trajectory seg- ments and provide feedback responses and ④ a feedback translator that transforms diverse feedback labels into a standardized format.


The Uni-RLHF supports both online and offline training modes. Some representative tasks from these environments are visualized above. Furthermore, Uni-RLHF allows easy customization and integration of new offline datasets by simply adding three functions.

Standardized Feedback Encoding Format for Reinforcement Learning


Comparative Feedback

Attribute Feedback

Evaluative Feedback

Keypoint Feedback

Visual Feedback

Standardized Feedback Encoding Format for Reinforcement Learning. To capture and utilize diverse and heterogeneous feedback labels from annotators, we analyze a range of research and propose a standardized feedback encoding format along with possible training methodologies.

Large-scale Crowdsourced Annotation Pipeline

To validate the ease of use of various aspects of the Uni-RLHF platform, we implemented large-scale crowdsourced annotation tasks for feedback label collection, using widely recognized Offline RL datasets. After completing the data collection, we conducted two rounds of data filtering to minimize the amount of noisy crowdsourced data. We establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 32 popular tasks. Our goal is to construct crowdsourced data annotation pipelines around the Uni-RLHF, facilitating the creation of large-scale annotated datasets via parallel crowdsourced data annotation and filtering.


The effectiveness of each component in the annotation pipeline. We initially sampled 300 trajectory segments of the left-c task in SMARTS for expert annotation, referred to as Oracle. We had five crowdsourcing instances, each annotating 100 trajectories in three distinct settings. naive implies only seeing the task description, +example allows for viewing five expert-provided annotation samples and detailed analysis, and +filter adds filters to the previous conditions. The experimental results displayed in above, revealed that each component significantly improved annotation accuracy, ultimately achieving a 98% agreement rate with the expert annotations.

Evaluating Benchmarks for offline RLHF

Finally, we conducted numerous experiments on downstream decision-making tasks, utilizing the collected crowdsourced feedback datasets to verify the reliability of the Uni-RLHF system.


Evaluating Offline RL with Comparative Feedback

D4RL Results. We used Oracle to represent models trained using hand-designed task rewards. In addition, we assessed two different methods of acquiring labels: one is crowd-sourced labels obtained by crowd-sourcing through the Uni-RLHF system, denoted as CS, and the other is synthetic labels generated by script teachers based on ground truth task rewards, which can be considered as expert labels, denoted as ST.


SMARTS Results. We studied three typical scenarios in autonomous driving scenarios. The reward function design is particularly complex because it requires more extensive expertise and balancing multiple factors. We empirically demonstrate that we can achieve competitive performance simply by crowdsourced annotations, compared to carefully designed reward functions or scripted teachers.


Left_c

Cutin

Cruise

Oracle model (left) and our CS model (right) for three scenarios in autonomous driving

Left_c video presents an instance of a left-turn-cross scenario in which the CS model succeeds, while the Oracle model fails due to a collision. In this scenario, the ego vehicle is required to start from a single-lane straight road, make a left turn at the cross intersection, proceed onto another two-lane straight road, and ultimately reach a goal located at the end of one of these two lanes. The video reveals that the Oracle model fails to timely decelerate the ego vehicle, resulting in a collision with the vehicle ahead, whereas the CS model opts to stop and wait for the vehicle in front, eventually passing through the intersection smoothly.

Evaluating Offline RL with Attribute Feedback

Trained model’s ability to switch behavior. We ran the model for 1000 steps and visualized the behavior switching by adjusting the target attributes every 200 steps, recording the walker’s speed and torso height. The attribute values for speed were set to [0.1, 1.0, 0.5, 0.1, 1.0], and for height, they were set to [1.0, 0.6, 1.0, 0.1, 1.0]. The corresponding changes in attributes can be clearly observed in the curves and corresponding video.

Human Evaluation. we conducted human evaluation experiments for the agents trained by attribute feedback. We first collected 3 trajectories with different humanness, which were set to [0.1, 0.6, 1] and invited five human evaluators. The human evaluators performed a blind selection to judge which video is most human-like and which video is least human-like.



Finally, all people correctly chose the highest humanness trajectory, and only one person chose the lowest humanness trajectory incorrectly. The experimental results confirm that agents through RLHF training are able to follow the abstraction metrics well.

Online Experiments

We validate the effectiveness of the online mode of Uni-RLHF, which allow agents to learn novel behaviors where a suitable reward function is difficult to design. Finally, we give totally 200 queries of human feedback for walker front flips experiments and we observe that walker can master the continuous multiple front flip fluently.



Visualization of the Platform

Annotation

Create Task

BibTeX

@inproceedings{yuan2023unirlhf,
    title={Uni-{RLHF}: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback},
    author={Yuan, Yifu and Hao, Jianye and Ma, Yi and Dong, Zibin and Liang, Hebin and Liu, Jinyi and Feng, Zhixin and Zhao, Kai and Zheng, Yan}
    booktitle={The Twelfth International Conference on Learning Representations, ICLR},
    year={2024},
    url={https://openreview.net/forum?id=WesY0H9ghM},
}