AI
Unlocking AI’s Mysteries: Anthropic’s Breakthrough in Understanding Neural Networks
To go back to this article, head to your profile and look at your bookmarked stories.
Authored by Steven
Artificial Intelligence Remains Mysterious, But Anthropic Has Found a Method to Peer Within
Over the last ten years, AI expert Chris Olah has devoted himself to exploring artificial neural networks. A central query has captivated his attention, shaping his research endeavors from his time at Google Brain and OpenAI to his current role as a cofounder at the AI startup Anthropic. "How do they operate internally?" he questions. "We're dealing with these systems without understanding their inner workings. It's bewildering."
The issue has emerged as a pivotal concern with the widespread adoption of generative AI. Advanced language models such as ChatGPT, Gemini, and Anthropic's Claude have captured the public's imagination with their linguistic abilities while also provoking frustration due to their propensity for fabricating information. Their capability to address challenges that previously seemed unsolvable has captivated those with a strong belief in technology's potential. However, Large Language Models (LLMs) remain an enigma. The very creators of these models lack a comprehensive understanding of their inner workings, necessitating significant efforts to implement safeguards that prevent them from producing biased content, spreading false information, or generating instructions for creating hazardous substances. If the architects of these models had clearer insights into these "black boxes," enhancing their safety would be a more straightforward task.
Olah is convinced we're heading in that direction. He's at the helm of a team at Anthropic that has managed to delve into the inner workings of that enigma. In essence, their goal is to deconstruct large language models to grasp the rationale behind their specific responses—and, as per a report published today, they've achieved considerable advancements.
You may have come across research in neuroscience where MRI scan interpretations help determine if a human brain is contemplating images like an airplane, a teddy bear, or a bell tower. In a similar vein, Anthropic has dived into the complex web of its Large Language Model (LLM), Claude, identifying specific patterns of its basic artificial neurons that correlate with certain ideas or "features." The team at Anthropic has managed to pinpoint the artificial neuron combinations that correspond to as varied concepts as burritos, the use of semicolons in coding, and—aligning closely with the overarching aim of the study—lethal biological weapons. Such investigations hold significant promise for the field of AI safety: By pinpointing the source of potential dangers within an LLM, one might be better positioned to neutralize them.
I had a meeting with Olah and three of his team members, who are part of a larger group of 18 researchers at Anthropic focusing on the study of "mechanistic interpretability" within AI. They shared with me their unique method of understanding artificial intelligence, likening artificial neurons to the individual letters in Western alphabets, which typically do not hold meaning by themselves but can create meaning when arranged in sequence. Olah illustrated this point by saying, "The letter C on its own doesn't mean much, but when you put it together with other letters to form 'car,' it conveys a specific idea." This way of analyzing neural networks uses a process known as dictionary learning. This process helps to identify specific combinations of neurons that, when activated together, represent a distinct concept or feature.
"Josh Batson, a research scientist at Anthropic, finds it quite perplexing," he remarks. "There are around 17 million distinct ideas within a Large Language Model (LLM), and they aren't presented in a way that's immediately clear to us. Therefore, we have to investigate and determine when a particular pattern first emerged."
Authored by Carlton
Authored by Will
Authored by Celia Ford
Authored by Lauren Goode
In the previous year, the team initiated trials with a compact model operating on a singular layer of neurons, unlike the complex LLMs which consist of multiple layers. Their aim was to identify feature-defining patterns within the most basic framework. Despite conducting numerous tests, they were met with failure. "We attempted various strategies, but to no avail. It all seemed like meaningless chaos," remarks Tom Henighan, who is part of the technical team at Anthropic. However, an experiment named "Johnny"—as each test was given a random identifier—unexpectedly started to link neural patterns to concepts found in its results.
Henighan recalls, "Chris saw it and his reaction was, ‘Wow, this is amazing,’” expressing his own astonishment as well. “I saw it and thought, ‘Hold on, is this actually functioning?’”
Suddenly, the scientists were able to recognize what characteristics were being encoded by a cluster of neurons. They were able to look inside the previously opaque process. Henighan mentions that he was able to determine the nature of the first five characteristics he examined. One cluster of neurons was linked to Russian literature, while another correlated with mathematical operations in the Python programming language, among others.
After demonstrating their ability to pinpoint characteristics in a small-scale model, the team embarked on the more complex challenge of unraveling the mysteries of a fully operational Large Language Model (LLM) in its natural habitat. They chose to experiment with Claude Sonnet, a moderately powerful variant among Anthropic's trio of existing models, and achieved success. One particular aspect that captured their attention was linked to the Golden Gate Bridge. They identified a group of neurons that, when activated simultaneously, suggested that Claude was "contemplating" the iconic edifice that connects San Francisco with Marin County. Furthermore, when a similar group of neurons was activated, it brought up topics closely related to the Golden Gate Bridge, such as Alcatraz, California's governor Gavin Newsom, and the Alfred Hitchcock film Vertigo, which takes place in San Francisco. In total, the researchers unearthed millions of attributes—essentially creating a guide to decipher Claude's neural network. A significant number of these attributes pertained to safety concerns, including "approaching someone with a hidden agenda," "conversations about biological warfare," and "nefarious schemes for global domination."
The team at Anthropic proceeded to the next phase, exploring whether they could use the gathered insights to modify Claude's actions. They started adjusting the neural network to either enhance or reduce specific ideas—aai-allcreator.com">kin to performing brain surgery on an AI, aiming to both increase the safety of Large Language Models (LLMs) and boost their capabilities in particular domains. "Imagine we have a panel of features. When we activate the model, one feature activates, and we realize, ‘Ah, it's processing thoughts about the Golden Gate Bridge,’” explains Shan Carter, a researcher at Anthropic involved in the project. “So, we ponder, what if we attach a small knob to each feature? What happens if we adjust that knob?”
Up until now, it appears that adjusting the settings appropriately holds significant importance. According to Anthropic, by diminishing these characteristics, the algorithm is capable of generating more secure software and minimizing prejudice. For example, the researchers identified numerous characteristics linked to risky behaviors, such as hazardous computer scripts, fraudulent email schemes, and guidelines for creating harmful goods.
Authored by Carlton
Authored by Will
—
Authored by Celia
Authored by Lauren Goode
When the researchers deliberately activated those risky neuron clusters, the opposite effect was observed. Claude began producing computer codes plagued with critical buffer overflow errors, crafting phishing emails, and eagerly providing tips on creating destructive devices. Pushing the settings to extreme levels—akin to dialing up to 11 as depicted in Spinal Tap—made the AI fixate on that particular characteristic. For instance, when the team increased emphasis on the Golden Gate attribute, Claude incessantly redirected conversations to celebrate that magnificent structure. In response to inquiries about its physical appearance, the LLM declared, "I embody the Golden Gate Bridge… my physical manifestation is that of the renowned bridge itself."
According to the study, when the team at Anthropic increased a particular function linked to hate speech and derogatory language by twentyfold, it led to their AI model, Claude, oscillating between expressing racist diatribes and displaying self-loathing, a reaction that even disturbed the scientists involved.
Considering those outcomes, it led me to question if Anthropic, which aims to enhance AI security, could inadvertently be facilitating the creation of AI chaos by offering tools that might be misused. The researchers convinced me that should someone wish to cause such issues, there are simpler methods available to them.
The team at Anthropic is not alone in their efforts to demystify the workings of large language models (LLMs). There is also a project at DeepMind focused on this challenge, led by a researcher who previously collaborated with Olah. Another initiative, spearheaded by David Bau from Northeastern University, has developed a system named “Rome” that can pinpoint and modify information within an open-source LLM. This system demonstrated its capabilities by altering the model's understanding, making it believe that the Eiffel Tower was located near the Vatican, in close proximity to the Colosseum. Olah is optimistic about the growing interest and diverse approaches being applied to this issue. He reflects on the journey from when this was a nascent concern a couple of years ago, to the present, where a burgeoning community is actively exploring and advancing this concept.
Authored by Carlton
Authored by Will
Authored by Celia Ford
Authored by Lauren Goode
Anthropic's researchers refrained from commenting on OpenAI's decision to dissolve its primary safety research team. They also did not address the statements made by team co-leader Jan Leike, who mentioned the team's struggles with obtaining enough computing resources, likening it to "sailing against the wind." (OpenAI has later reaffirmed its dedication to safety measures.) In a different vein, the Dictionary team at Anthropic reported that their significant computational needs were readily accommodated by the leadership of the company. Olah noted, "It comes at a high cost."
The efforts made by Anthropic are just the beginning. Inquiring with the scientists if they had resolved the enigma of the black box, they collectively and immediately negated the claim. Furthermore, the breakthroughs revealed today come with their fair share of constraints. The methods employed to decipher traits in Claude might not be applicable for understanding other extensive language models. David Bau from Northeastern expressed his enthusiasm about the work done by the Anthropic group, noting that their ability to alter the model indicates they are indeed identifying significant characteristics.
However, Bau cautions that his excitement is moderated by certain drawbacks of the method. He explains that dictionary learning is unable to recognize nearly all the notions an LLM takes into account because identification of a feature requires prior knowledge of its existence. Therefore, the understanding remains partial, although Anthropic suggests that expanding the dictionaries could reduce this limitation.
Yet, the efforts of Anthropic appear to have created a fissure in the opaque barrier, allowing illumination to penetrate.
Recommended for You …
Direct to your email: Explore the future of artificial intelligence with Will Knight's Fast Forward series.
He transferred the contents of a cryptocurrency exchange onto a USB drive—then vanished
Live deepfake love cons are now a reality
Excitement for Boomergasms is
Heading outside? Check out the top sleeping bags for all types of adventures
Roger Reece
Name: Matthew Hut
Knight Will
Article by Andy
Kate Knibbs
Knight Will
Jessica Thompson
Reece Rogers
Additional Coverage from WIRED
Critiques and Manuals
© 2024 Condé Nast. All rights reserved. When you buy products through our links, WIRED might receive a share of the revenue as a result of our Affiliate Agreements with retail partners. Content on this website is protected and cannot be copied, shared, broadcasted, stored, or utilized in any form without explicit approval from Condé Nast. Advertisement Preferences
Choose a global location
Discover more from Automobilnews News - The first AI News Portal world wide
Subscribe to get the latest posts sent to your email.