Do you really need reinforcement learning (RL) in RLHF? New Stanford Research Proposes Directed Preference Optimization (DPO): A Simple Training Paradigm for RL-Free Preference Training Language Models

 Do you really need reinforcement learning (RL) in RLHF?  New Stanford Research Proposes Directed Preference Optimization (DPO): A Simple Training Paradigm for RL-Free Preference Training Language Models

When trained on massive datasets, massive unsupervised LMs gain powers that astound even their creators. These models, however, are trained on information produced by people with a wide range of motivations, goals and capabilities. Not all of these ambitions and abilities can be emulated. It is important to carefully select the desired response and behavior patterns from its vast archive of information and expertise to build reliable, effective and manageable systems.

Without using explicit reward models or reinforcement learning, researchers at Stanford University and CZ demonstrate how to optimize a language model to conform to human tastes. Their work shows that the RL-based target employed by current approaches can be optimized exactly with a simple binary cross-entropy target, greatly simplifying the preference learning process and demonstrating how this can be done in practice.

They propose direct preference optimization (DPO). This new algorithm implicitly achieves the same goal as existing RLHF algorithms (reward maximization with a KL divergence constraint) but is easier to build and train. While the DPO update intuitively increases the logarithmic ratio between preferred and rejected responses, it also includes a dynamic significance weight for example that prevents model degradation.

Check out 100s AI Tools in our AI Tools Club

Like other algorithms, DPO evaluates the consistency of a reward function with empirical preference data using a theoretical preference model. Whereas conventional approaches define a preference loss using the preference model to train a reward model, DPO instead trains a policy that maximizes the learned reward model using a variable switch. Thus, DPO can optimize a policy with a simple binary cross-entropy goal given a data set of human preferences versus model responses without explicitly learning a reward function or sampling from the policy during training.

The work results demonstrate that DPO is as effective as state-of-the-art approaches, such as PPO-based RLHF, for preference-based learning on various tasks, including sentiment modulation, synthesis and dialogue, with language models containing up to 6B parameters. 58% of people prefer DPO summaries to PPO summaries (human ratings), and 61% prefer DPO summaries to human ratings in the test set. On Anthropic HH, 60% of the time, single-round responses from DPOs are preferred over selective completions.

The team says DPO has many potential uses beyond just training language patterns based on human preferences. For example, it can train generative models in various ways.

The proposed model ratings reach 6B parameters, but the team believes further work should explore scaling the DPO to state-of-the-art models with orders of magnitude more data. The researchers also found that the prompt affects GPT -4’s calculated win rates. In the future, they plan to investigate the most effective means of getting expert opinions from machines.

Check out ThePaper.Don’t forget to subscribeour 22k+ ML SubReddit,Discord channel,ANDEmail newsletter, where we share the latest news on AI research, cool AI projects, and more. If you have any questions regarding the above article or if you have missed anything, please do not hesitate to email us

Check out 100s AI Tools in the AI ​​Tools Club

Tanushree Shenwai is a Consulting Intern at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Bhubaneswar. She is passionate about Data Science and has a keen interest in the application scope of Artificial Intelligence in various fields. She is passionate about exploring new advancements in technologies and their real-life application.

Ultimate Guide to Data Labeling in Machine Learning

#reinforcement #learning #RLHF #Stanford #Research #Proposes #Directed #Preference #Optimization #DPO #Simple #Training #Paradigm #RLFree #Preference #Training #Language #Models

Previous articleThe 5 Insidious Ways AI Has Already Impacted Your Life For Years
Next articleThe role of wireless charging in the Internet of Things (IoT) ecosystem.


Please enter your comment!
Please enter your name here