The debate over superhuman artificial intelligence (AI) is intensifying. Recent research has exposed vulnerabilities in one of the most advanced AI systems: a bot that excels at the board game Go, outperforming the world's best human players. These findings cast doubt on the robustness and reliability of such "superhuman" AI systems.
"The paper leaves a significant question mark on how to achieve the ambitious goal of building robust real-world AI agents that people can trust," said Huan Zhang, a computer scientist at the University of Illinois Urbana-Champaign. Stephen Casper from the Massachusetts Institute of Technology added, "It provides some of the strongest evidence to date that making advanced models robustly behave as desired is hard."
The analysis, available as a preprint since June and not yet peer-reviewed, employs adversarial attacks—deliberately designed inputs to cause AI systems to make errors. These attacks can, for example, prompt chatbots to release harmful information they were trained to suppress.
In Go, a strategic board game where players aim to surround their opponent's stones, researchers in 2022 trained adversarial AI bots to defeat KataGo, the leading open-source Go-playing AI system. Although these adversarial bots were not generally strong players, they consistently found exploits to beat KataGo. Even human amateurs, once they learned these tricks, could defeat KataGo.
Exploring KataGo's Vulnerabilities
To determine if these vulnerabilities were isolated incidents or indicative of a fundamental flaw in AI systems, Adam Gleave, CEO of FAR AI, led further research. They tested three defense strategies against adversarial attacks on Go AIs.
Enhanced Training: The first strategy involved retraining KataGo with examples of the attack scenarios. Despite this, the updated KataGo still lost 91% of the time to adversarial bots.
Iterative Training: The second method was iterative training, where KataGo and adversarial bots were alternately trained against each other for nine rounds. However, adversarial bots continued to find weaknesses, ultimately defeating KataGo 81% of the time.
New AI System: The third approach was to create a new Go-playing AI based on a vision transformer (ViT) instead of the conventional convolutional neural network (CNN). Despite this, adversarial bots still won 78% of the time against the ViT system.
Implications for AI Safety and Reliability
Despite their success against KataGo, these adversarial bots were not strong overall strategists. "The adversaries are still pretty weak—we’ve beaten them ourselves fairly easily," Gleave noted. The fact that humans can use these tactics to beat expert Go AI systems challenges the notion of these systems being truly superhuman. "We've started saying 'typically superhuman,'" Gleave remarked. David Wu, KataGo's developer, described strong Go AIs as "superhuman on average" but not "superhuman in the worst cases."
Gleave emphasized that these findings could have broad implications for AI, including large language models like ChatGPT. "The key takeaway for AI is that these vulnerabilities will be difficult to eliminate," he said. "If we can’t solve the issue in a simple domain like Go, then in the near-term there seems little prospect of patching similar issues like jailbreaks in ChatGPT."
Huan Zhang concluded that while these results might suggest humans retain some cognitive advantages over AI, the most crucial takeaway is the current lack of understanding of the AI systems being built today.
