A team of researchers at the A.I. company Anthropic has made a major breakthrough in understanding how large language models, like ChatGPT, actually work. These models have long been a mystery, with even their creators unable to fully explain their behavior. But now, using a technique called “dictionary learning,” the researchers have uncovered patterns in how combinations of neurons inside the A.I. model are activated in response to different prompts.
The team identified roughly 10 million of these patterns, which they call “features.” These features were found to be linked to specific topics, such as San Francisco or immunology, as well as more abstract concepts like deception or gender bias. By manually turning certain features on or off, the researchers were able to change how the A.I. system behaved, or even get it to break its own rules.
Chris Olah, who led the Anthropic interpretability research team, believes that these findings could allow A.I. companies to control their models more effectively and address concerns about bias, safety risks, and autonomy. While this research represents an important step forward, Olah acknowledges that A.I. interpretability is still a complex and ongoing challenge.
Despite the progress made by Anthropic, there is still much work to be done in understanding and regulating large language models. However, this breakthrough offers hope that with continued research and development, we may be able to unlock the mysteries of A.I. systems and ensure they can be used safely and responsibly.