Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

April 28, 2025

Over the past few years, incorporating human feedback into LM training has proven to be effective in reducing false, toxic, or otherwise undesirable outputs. A popular approach for integrating human feedback is Reinforcemen t Learning from Human Feedback (RLHF), a framework that transforms human judgments into training signals to guide language model development.

Typically, RLHF involves presenting human evaluators with two or more model-generated outputs and asking them to select or rank the preferred outputs. These rankings are used to train a reward model, which in turn assigns a scalar reward to each model-generated sequence.

The language model is then fine-tuned using reinforcement learning to maximize these rewards. However, while effective, this process often results in sparse training signals, especially for tasks that require long-form generation, making RLHF less reliable in such domains.

Research has shown that it is difficult for human annotators to consistently evaluate the overall quality of complex outputs, especially when outputs contain a mixture of different types of errors. This observation leads to a natural question: Can we improve rewards for language model training by using more fine-grained human feedback?

To address the limitations of traditional RLHF, researchers have introduced Fine-Grained RLHF, a new framework that allows for training reward functions capable of providing detailed, localized feedback across different types of model errors.

In this blog, we will explore Fine-Grained Reinforcement Learning from Human Feedback (Fine-Grained RLHF), an innovative approach to improve language model training by providing more detailed, localized feedback. We’ll discuss how it addresses the limitations of traditional RLHF, its applications in areas like detoxification and long-form question answering, and the broader implications for building safer, more aligned AI systems.

What is Fine-Grained RLHF

Unlike previous approaches that generate a single holistic reward, Fine-Grained RLHF breaks down the evaluation process, offering dense rewards across smaller segments of output and for specific categories of undesired behaviors.

Fine-Grained RLHF reframes language generation as a Markov Decision Process (MDP), where each token generation is an action taken within an environment defined by a vocabulary. The process starts with an initial prompt and continues token-by-token until a complete sequence is generated. Rewards are given throughout the generation process, not just at the end, providing a much denser and more informative learning signal. The learning algorithm used is Proximal Policy Optimization (PPO), a widely adopted actor-critic method in RLHF setups, which stabilizes training by clipping policy updates and using advantage estimates.

Building Fine-Grained Reward Models

In traditional RLHF, a single scalar reward is assigned based on the overall quality of the final output. In contrast, Fine-Grained RLHF utilizes multiple reward models, each focused on a distinct error type, and assigns rewards throughout the generation process. This approach enables models to receive immediate feedback for specific mistakes like factual errors, incoherence, or repetition.

For example, suppose a model generates a toxic sentence midway through an otherwise acceptable output. In that case, the fine-grained reward model can immediately penalize that specific segment without waiting for the entire sequence to complete. This dense, category-specific feedback allows for more targeted improvements in model behavior, leading to higher-quality outputs with greater sample efficiency.

Detoxification through Fine-Grained Rewards

One of the first applications of Fine-Grained RLHF is detoxification, aimed at reducing toxicity in model outputs. Experiments were conducted using the REALTOXICITYPROMPTS dataset, which contains prompts likely to provoke toxic responses from models like GPT-2.

A research study used the Perspective API to evaluate toxicity, two reward approaches were compared: a holistic reward applied after the full sequence generation, and a fine-grained reward applied at the sentence level. The fine-grained reward was calculated by measuring the change in toxicity score after each new sentence was generated.

Results demonstrated that the fine-grained approach was significantly more sample-efficient, achieving lower toxicity scores with fewer training steps compared to the holistic reward method. Importantly, it also maintained higher fluency in the generated text, as measured by perplexity metrics. These findings show that providing dense, localized feedback helps models learn desirable behaviors more effectively.

Improving Long-Form Question Answering with Fine-Grained Feedback

Another domain where Fine-Grained RLHF showed promise is long-form question answering (QA). It requires generating detailed, coherent, and factually accurate responses to complex questions.

To study this, researchers created a new dataset, QA-FEEDBACK, based on ASQA, a dataset focused on answering ambiguous factoid questions with comprehensive explanations.

Fine-grained human feedback was collected on model-generated responses, categorized into three distinct error types: (1) irrelevance, repetition, or incoherence; (2) factual inaccuracies; and (3) incomplete information. Annotators marked specific spans in the output associated with each error type, and separate reward models were trained for each category.

Experiments showed that Fine-Grained RLHF outperformed traditional preference-based RLHF and supervised fine-tuning methods across all categories. Notably, by adjusting the relative importance of each reward model during training, researchers could fine-tune the model’s behavior to prioritize different user needs, for example, emphasizing factual correctness over fluency if desired. This flexibility represents a significant advancement in building customizable AI systems.

Moreover, analysis revealed that different fine-grained reward models sometimes compete against one another. For instance, improving fluency might occasionally conflict with strict factuality. Understanding these dynamics can further help in designing better training objectives depending on the end-user requirements.

Broader Implications for RLHF and Human Feedback in Gen AI

Fine-Grained RLHF is part of a broader trend of using human feedback not just to validate model outputs, but to actively guide model training in a much more detailed and nuanced way. Beyond reinforcement learning, other research has explored learning from human feedback via supervised fine-tuning, conversational modeling, and natural language explanations.

However, Fine-Grained RLHF offers unique advantages. By focusing on localized errors and providing dense, real-time rewards, it allows language models to adapt more quickly and robustly to human values and expectations. It can also improve annotation efficiency, as targeted feedback is often easier for annotators to provide compared to holistic rankings or full rewrites.

Moreover, fine-grained methods could work in tandem with inference-time control techniques, which aim to steer model behavior at generation time rather than during training. Combined, these methods present a powerful toolkit for building safer, more reliable, and more personalized AI systems.

Conclusion

Fine-grained human feedback marks a significant step forward in training high-quality, aligned language models. By moving beyond holistic scoring and offering dense, targeted guidance throughout the generation process, Fine-Grained RLHF addresses many of the shortcomings of traditional reinforcement learning approaches.

Experiments in both detoxification and long-form question answering show clear advantages in terms of sample efficiency, output quality, and customization flexibility. As AI systems continue to become more complex and widely deployed, incorporating nuanced, fine-grained feedback into training processes will be crucial to ensuring they behave in ways that align with human values and expectations.

Looking ahead, integrating fine-grained feedback methods with other advancements in AI safety and interpretability could pave the way for building models that are not only more powerful but also far more trustworthy and controllable.

Leverage RLHF techniques to refine your models, DDD ensures better human-like outputs and task-specific results. To learn more, talk to our experts.

References:

Wu, Z., Hu, Y., Shi, W., Dziri, N., Suhr, A., Ammanabrolu, P., Smith, N. A., Ostendorf, M., & Hajishirzi, H. (2023). Fine-grained human feedback gives better rewards for language model training. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023). https://papers.neurips.cc/paper_files/paper/2023/file/b8c90b65739ae8417e61eadb521f63d5-Paper-Conference.pdf

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback (arXiv:2204.05862). arXiv. https://arxiv.org/abs/2204.05862

Stiennon, N., et al. (2020). Learning to summarize with human feedback. In Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2009.01325

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training Read Post »

Large Language Model Training