Whether it’s climate change, global pandemics, politicized misinformation, or runaway Artificial Intelligence (AI), the surest way to build a sustainable career as a grant-seeking scientist is to research new ways to keep the people safe from boogeymen.
The latest entry in the pantheon of existential threats is the curse of Large Language Models (LLMs) generating toxic output. Toxic, unwanted, or hateful content can include everything from teaching precocious high school students how to build an atom bomb to uttering forbidden words to misgendering important public officials.
How can developers possibly hunt down all the toxic things an LLM can say so filters can be designed to block them? And who decides what qualifies as toxic? Get it wrong and you could be the next Google Gemini laughingstock.
Until now this required hiring humans to serve as a “red team” tasked with tricking LLMs into saying bad things. Alas, humans are expensive, and even the largest teams can barely scratch the universe of all the bad things an LLM could say.
The MIT-IBM Watson AI Lab is promising a faster, better way to prevent toxic spewing. They train the AIs to police themselves!
All you need is more NVIDIA GPUs to drive a billion virtual monkeys-at-the-keyboard, taking care to feed them the right rewards and punishments. This is done using the newly developed Curiosity Driven Red Teaming (CRT) methodology, outfitted with standard gibberish and word salad detectors to keep the monkeys from cheating.
Lo and behold, virtual monkeys stimulate exponentially more toxic output than people. This output is evaluated using a RoBERTa toxicity classifier trained with the TOXIGEN dataset to guide output filter design. (Other toxicity classifiers can be used depending on the desired level of wokeness.)
Of course, more research is required to make LLMs stop saying bad things. Next step is to design an AI that can generate comprehensive, up-to-date, and more easily offended toxicity classifiers.
Story suggested by MIT News


0 Comments