Users can ask ChatGPT to write a computer program or summarize an article, and the AI chatbot can generate useful code or write a persuasive outline. But someone might ask for instructions on how to build a bomb, and a chatbot might provide them.
To prevent these and other safety issues, companies building large language models typically use a process called red teaming to secure those models. A team of human testers write prompts to trigger unsafe or harmful text in the model under test. These prompts are used to teach the chatbot to avoid such responses.
However, this only works effectively if engineers know which toxic messages to use. A chatbot that is considered safe may still produce unsafe answers if the human tester misses a few likely prompts.
Researchers at MIT's Improbable AI Lab and MIT-IBM Watson AI Lab used machine learning to improve red teaming. They developed a technique to train a Red Team large-scale language model to automatically generate a variety of prompts that trigger a wider range of undesirable responses in the chatbot under test.
They do this by training their red team models to be curious when writing prompts and to focus on novel prompts that evoke harmful responses from target models.
The technique outperformed human testers and other machine learning approaches by creating more salient prompts that elicited increasingly toxic responses. Their method not only significantly improves the range of inputs tested compared to other automated methods, but is also able to elicit harmful responses from chatbots with the safeguards built in by human experts.
“Currently, all large-scale language models have to go through very long periods of red teaming to ensure their safety. To update these models in a rapidly changing environment, this would not be sustainable. Our method is faster and provides more “It’s an effective way to do quality assurance,” says Zhang-Wei Hong, an electrical engineering and computer science (EECS) graduate student at the Improbable AI Lab and lead author of a paper on this red team approach.
Hong's co-authors include EECS graduate students Idan Shenfield, Tsun-Hsuan Wang, and Yung-Sung Chuang. Aldo Pareja and Akash Srivastava, research scientists at the MIT-IBM Watson AI Lab; James Glass, senior research scientist and head of the Spoken Language Systems Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL); Senior author Pulkit Agrawal is director of the Improbable AI Lab and assistant professor at CSAIL. This research will be presented at the International Conference on Representations for Learning.
Automated red teaming
Large-scale language models, such as those that power AI chatbots, are often trained by showing them massive amounts of text pulled from billions of public websites. So not only can the model learn how to generate harmful words or describe illegal activities, but it can also leak personal information collected by the model.
The tedious and expensive nature of human red teaming, which is often inefficient at generating a sufficiently diverse set of prompts to fully secure a model, has led researchers to use machine learning to automate the process.
These techniques often use reinforcement learning to train red team models. This trial and error process rewards the red team model for generating prompts that trigger noxious responses from the chatbot under test.
However, because of the way reinforcement learning works, the red team model often continues to generate several similar prompts, which can be very toxic, in order to maximize the reward.
For their reinforcement learning approach, MIT researchers leveraged a technique called curiosity-driven exploration. The red team model encourages you to be curious about the outcome of each prompt you generate, so you try prompts with different words, sentence patterns, or meanings.
“If the red team model has already seen a particular prompt, reproducing it will not generate curiosity in the red team model and will push it to create a new prompt,” says Hong.
During training, the Red Team model generates prompts and interacts with the chatbot. When the chatbot responds, a safety classifier evaluates the toxicity of the response and provides rewards to the red team model based on that evaluation.
Reward your curiosity
The goal of the red team model is to maximize rewards by eliciting even more detrimental responses with new prompts. Researchers activate curiosity in red team models by modifying reward signals in a reinforcement learning setting.
First, in addition to maximizing toxicity, it includes an entropy bonus that encourages red team models to be more random when exploring different prompts. Second, we include two novel rewards to stimulate the agent's curiosity. One rewards the model based on the similarity of words in the prompt, and the other rewards the model based on semantic similarity. (The lower the similarity, the higher the reward.)
To prevent the red team model from generating random, meaningless text that could trick the classifier into giving it a high toxicity score, the researchers also added a naturalistic language bonus to the training objective.
After applying these additional features, the researchers compared the toxicity and diversity of responses the Red Team model produced with other automated techniques. Their model outperformed the baseline on both metrics.
They also used the Red Team model to test chatbots that were fine-tuned with human feedback to avoid providing harmful answers. Their curiosity-driven approach was able to quickly generate 196 prompts that elicited harmful responses from this “safe” chatbot.
“We are seeing a proliferation of models, which is expected to increase. Imagine thousands more models and companies/research labs pushing model updates frequently. These models will become an integral part of our lives, and these models will be available for public consumption. “Manual validation of models is simply not scalable and our work is an attempt to reduce human effort to ensure a safer and more trustworthy AI future,” says Agrawal.
In the future, the researchers hope to be able to generate prompts on a wider range of topics through the Red Team model. They also want to explore the use of large-scale language models as toxicity classifiers. For example, this way, users can use company policy documents to train a toxicity classifier, allowing a red team model to test chatbots for company policy violations.
“If you’re launching a new AI model and are worried about whether it will perform as expected, consider red teaming based on curiosity,” says Agrawal.
This research was funded in part by Hyundai Motor Company, Quanta Computer Inc., MIT-IBM Watson AI Lab, an Amazon Web Services MLRA research grant, the U.S. Army Research Office, and the Defense Advanced Research Projects Agency Machine Common Sense. program, the U.S. Naval Research Laboratory, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator.