A study from AI researchers at OpenAI demonstrates how large language models like chatbots can be adapted to reflect specific societal values through a simple "fine-tuning" process. This allows guiding model behavior towards alignment with ethical priorities.
Language models trained on vast datasets can sometimes generate toxic, biased outputs. However, the researchers found that fine-tuning - further training on a small, hand-crafted dataset - can significantly improve model alignment with a predetermined set of values.
For instance, they tuned GPT-3 to oppose unhealthy beauty standards and stop using slurs when describing people. With just 80 text examples specifically designed to exhibit desired values, the fine-tuned models scored far better in human evaluations of value alignment.
The researchers propose an iterative process called PALMS to craft targeted datasets that encode values. The dataset prompts a model to generate text adhering to values, then reviews outputs to add further examples that address shortcomings. After fine-tuning on the dataset, alignment improved across metrics.
For business leaders deploying AI systems, this demonstrates the importance of human review and oversight to shape model behavior, rather than just training on available big datasets. Responsible AI requires encoding societal values into systems. The study shows that with a small dataset guided by values, model outputs can be significantly improved.
As chatbots and other AI proliferate, this provides a template to align them with ethical priorities, whether by fine-tuning existing models or training new ones. Rather than treating systems as fixed black boxes, the research highlights that model behavior can be adjusted towards human values.
Sources:
Irene Solaiman and Christy Dennison