AI

Unlocking AI’s Mysteries: Anthropic’s Breakthrough in Understanding Neural Networks

Published

4 months ago

May 22, 2024

To go back to this article, head to your profile and look at your bookmarked stories.

Authored by Steven

Artificial Intelligence Remains Mysterious, But Anthropic Has Found a Method to Peer Within

Over the last ten years, AI expert Chris Olah has devoted himself to exploring artificial neural networks. A central query has captivated his attention, shaping his research endeavors from his time at Google Brain and OpenAI to his current role as a cofounder at the AI startup Anthropic. "How do they operate internally?" he questions. "We're dealing with these systems without understanding their inner workings. It's bewildering."

The issue has emerged as a pivotal concern with the widespread adoption of generative AI. Advanced language models such as ChatGPT, Gemini, and Anthropic's Claude have captured the public's imagination with their linguistic abilities while also provoking frustration due to their propensity for fabricating information. Their capability to address challenges that previously seemed unsolvable has captivated those with a strong belief in technology's potential. However, Large Language Models (LLMs) remain an enigma. The very creators of these models lack a comprehensive understanding of their inner workings, necessitating significant efforts to implement safeguards that prevent them from producing biased content, spreading false information, or generating instructions for creating hazardous substances. If the architects of these models had clearer insights into these "black boxes," enhancing their safety would be a more straightforward task.

Olah is convinced we're heading in that direction. He's at the helm of a team at Anthropic that has managed to delve into the inner workings of that enigma. In essence, their goal is to deconstruct large language models to grasp the rationale behind their specific responses—and, as per a report published today, they've achieved considerable advancements.

You may have come across research in neuroscience where MRI scan interpretations help determine if a human brain is contemplating images like an airplane, a teddy bear, or a bell tower. In a similar vein, Anthropic has dived into the complex web of its Large Language Model (LLM), Claude, identifying specific patterns of its basic artificial neurons that correlate with certain ideas or "features." The team at Anthropic has managed to pinpoint the artificial neuron combinations that correspond to as varied concepts as burritos, the use of semicolons in coding, and—aligning closely with the overarching aim of the study—lethal biological weapons. Such investigations hold significant promise for the field of AI safety: By pinpointing the source of potential dangers within an LLM, one might be better positioned to neutralize them.

I had a meeting with Olah and three of his team members, who are part of a larger group of 18 researchers at Anthropic focusing on the study of "mechanistic interpretability" within AI. They shared with me their unique method of understanding artificial intelligence, likening artificial neurons to the individual letters in Western alphabets, which typically do not hold meaning by themselves but can create meaning when arranged in sequence. Olah illustrated this point by saying, "The letter C on its own doesn't mean much, but when you put it together with other letters to form 'car,' it conveys a specific idea." This way of analyzing neural networks uses a process known as dictionary learning. This process helps to identify specific combinations of neurons that, when activated together, represent a distinct concept or feature.

"Josh Batson, a research scientist at Anthropic, finds it quite perplexing," he remarks. "There are around 17 million distinct ideas within a Large Language Model (LLM), and they aren't presented in a way that's immediately clear to us. Therefore, we have to investigate and determine when a particular pattern first emerged."

Authored by Carlton

Authored by Will

Authored by Celia Ford

Authored by Lauren Goode

In the previous year, the team initiated trials with a compact model operating on a singular layer of neurons, unlike the complex LLMs which consist of multiple layers. Their aim was to identify feature-defining patterns within the most basic framework. Despite conducting numerous tests, they were met with failure. "We attempted various strategies, but to no avail. It all seemed like meaningless chaos," remarks Tom Henighan, who is part of the technical team at Anthropic. However, an experiment named "Johnny"—as each test was given a random identifier—unexpectedly started to link neural patterns to concepts found in its results.

Henighan recalls, "Chris saw it and his reaction was, ‘Wow, this is amazing,’” expressing his own astonishment as well. “I saw it and thought, ‘Hold on, is this actually functioning?’”

Suddenly, the scientists were able to recognize what characteristics were being encoded by a cluster of neurons. They were able to look inside the previously opaque process. Henighan mentions that he was able to determine the nature of the first five characteristics he examined. One cluster of neurons was linked to Russian literature, while another correlated with mathematical operations in the Python programming language, among others.

After demonstrating their ability to pinpoint characteristics in a small-scale model, the team embarked on the more complex challenge of unraveling the mysteries of a fully operational Large Language Model (LLM) in its natural habitat. They chose to experiment with Claude Sonnet, a moderately powerful variant among Anthropic's trio of existing models, and achieved success. One particular aspect that captured their attention was linked to the Golden Gate Bridge. They identified a group of neurons that, when activated simultaneously, suggested that Claude was "contemplating" the iconic edifice that connects San Francisco with Marin County. Furthermore, when a similar group of neurons was activated, it brought up topics closely related to the Golden Gate Bridge, such as Alcatraz, California's governor Gavin Newsom, and the Alfred Hitchcock film Vertigo, which takes place in San Francisco. In total, the researchers unearthed millions of attributes—essentially creating a guide to decipher Claude's neural network. A significant number of these attributes pertained to safety concerns, including "approaching someone with a hidden agenda," "conversations about biological warfare," and "nefarious schemes for global domination."

The team at Anthropic proceeded to the next phase, exploring whether they could use the gathered insights to modify Claude's actions. They started adjusting the neural network to either enhance or reduce specific ideas—aai-allcreator.com">kin to performing brain surgery on an AI, aiming to both increase the safety of Large Language Models (LLMs) and boost their capabilities in particular domains. "Imagine we have a panel of features. When we activate the model, one feature activates, and we realize, ‘Ah, it's processing thoughts about the Golden Gate Bridge,’” explains Shan Carter, a researcher at Anthropic involved in the project. “So, we ponder, what if we attach a small knob to each feature? What happens if we adjust that knob?”

Up until now, it appears that adjusting the settings appropriately holds significant importance. According to Anthropic, by diminishing these characteristics, the algorithm is capable of generating more secure software and minimizing prejudice. For example, the researchers identified numerous characteristics linked to risky behaviors, such as hazardous computer scripts, fraudulent email schemes, and guidelines for creating harmful goods.

Authored by Carlton

Authored by Will

—

Authored by Celia

Authored by Lauren Goode

When the researchers deliberately activated those risky neuron clusters, the opposite effect was observed. Claude began producing computer codes plagued with critical buffer overflow errors, crafting phishing emails, and eagerly providing tips on creating destructive devices. Pushing the settings to extreme levels—akin to dialing up to 11 as depicted in Spinal Tap—made the AI fixate on that particular characteristic. For instance, when the team increased emphasis on the Golden Gate attribute, Claude incessantly redirected conversations to celebrate that magnificent structure. In response to inquiries about its physical appearance, the LLM declared, "I embody the Golden Gate Bridge… my physical manifestation is that of the renowned bridge itself."

According to the study, when the team at Anthropic increased a particular function linked to hate speech and derogatory language by twentyfold, it led to their AI model, Claude, oscillating between expressing racist diatribes and displaying self-loathing, a reaction that even disturbed the scientists involved.

Considering those outcomes, it led me to question if Anthropic, which aims to enhance AI security, could inadvertently be facilitating the creation of AI chaos by offering tools that might be misused. The researchers convinced me that should someone wish to cause such issues, there are simpler methods available to them.

The team at Anthropic is not alone in their efforts to demystify the workings of large language models (LLMs). There is also a project at DeepMind focused on this challenge, led by a researcher who previously collaborated with Olah. Another initiative, spearheaded by David Bau from Northeastern University, has developed a system named “Rome” that can pinpoint and modify information within an open-source LLM. This system demonstrated its capabilities by altering the model's understanding, making it believe that the Eiffel Tower was located near the Vatican, in close proximity to the Colosseum. Olah is optimistic about the growing interest and diverse approaches being applied to this issue. He reflects on the journey from when this was a nascent concern a couple of years ago, to the present, where a burgeoning community is actively exploring and advancing this concept.

Authored by Carlton

Authored by Will

Authored by Celia Ford

Authored by Lauren Goode

Anthropic's researchers refrained from commenting on OpenAI's decision to dissolve its primary safety research team. They also did not address the statements made by team co-leader Jan Leike, who mentioned the team's struggles with obtaining enough computing resources, likening it to "sailing against the wind." (OpenAI has later reaffirmed its dedication to safety measures.) In a different vein, the Dictionary team at Anthropic reported that their significant computational needs were readily accommodated by the leadership of the company. Olah noted, "It comes at a high cost."

The efforts made by Anthropic are just the beginning. Inquiring with the scientists if they had resolved the enigma of the black box, they collectively and immediately negated the claim. Furthermore, the breakthroughs revealed today come with their fair share of constraints. The methods employed to decipher traits in Claude might not be applicable for understanding other extensive language models. David Bau from Northeastern expressed his enthusiasm about the work done by the Anthropic group, noting that their ability to alter the model indicates they are indeed identifying significant characteristics.

However, Bau cautions that his excitement is moderated by certain drawbacks of the method. He explains that dictionary learning is unable to recognize nearly all the notions an LLM takes into account because identification of a feature requires prior knowledge of its existence. Therefore, the understanding remains partial, although Anthropic suggests that expanding the dictionaries could reduce this limitation.

Yet, the efforts of Anthropic appear to have created a fissure in the opaque barrier, allowing illumination to penetrate.

Recommended for You …

Direct to your email: Explore the future of artificial intelligence with Will Knight's Fast Forward series.

He transferred the contents of a cryptocurrency exchange onto a USB drive—then vanished

Live deepfake love cons are now a reality

Excitement for Boomergasms is

Heading outside? Check out the top sleeping bags for all types of adventures

Roger Reece

Name: Matthew Hut

Knight Will

Article by Andy

Kate Knibbs

Knight Will

Jessica Thompson

Reece Rogers

Additional Coverage from WIRED

Critiques and Manuals

© 2024 Condé Nast. All rights reserved. When you buy products through our links, WIRED might receive a share of the revenue as a result of our Affiliate Agreements with retail partners. Content on this website is protected and cannot be copied, shared, broadcasted, stored, or utilized in any form without explicit approval from Condé Nast. Advertisement Preferences

Choose a global location

Discover more from Automobilnews News - The first AI News Portal world wide

Subscribe to get the latest posts sent to your email.

Automobilnews News – The first AI News Portal world wide

Unlocking AI’s Mysteries: Anthropic’s Breakthrough in Understanding Neural Networks

Related

Discover more from Automobilnews News - The first AI News Portal world wide

You may like

Leave a Reply Cancel reply

Leave a Reply

SUBSCRIBE FOR FREE

Unveiling Ferrari’s Latest Supercar Innovations: A Deep Dive into Maranello’s Masterpieces and Cutting-Edge Technologies

Nigel Mansell Criticizes Ferrari’s “Short-Sighted” Decision on Adrian Newey, Predicts Bright Future for Aston Martin

Revealing the AI Gap: How U.S. Teens Outpace Their Parents in Generative AI Use and Understanding

Peter Windsor Dismisses Russell’s Pirelli Complaints as “Nonsense,” Questions Mercedes Driver’s Approach Post-Azerbaijan GP

Revolutionizing Creativity: YouTube to Unleash Generative AI Video Creation with Veo Model Integration

Wolff Identifies Tyre Temperature Control as Mercedes’ Key Challenge at Singapore Grand Prix

SocialAI: Navigating the Echo Chamber of AI-Generated Companions

Into the AI Abyss: Navigating the Uncanny World of SocialAI

Nigel Mansell Weighs in on McLaren’s Team Strategy: Urges Lando Norris to “Step Up” Amid Title Race

Lionsgate and AI Firm Runway Forge Groundbreaking Partnership: A New Era for Film Production and Copyright Concerns

Renault Master H2 Tech: Der Wasserstoff-Revolutionär mit 700 km Reichweite stellt sich in Hannover vor

UN Calls for Global AI Oversight with Urgency Matching Climate Change Initiatives

AEC Erschließt Europäischen Markt mit GMC Yukon und Sierra – Luxuriöse US-Größen zu Stolzen Preisen

Adult Industry Advocates Seek Inclusion in AI Regulation Talks, Highlighting Oversight Risks

Alfa Romeo Junior (2024) Debütiert in Deutschland: Preise und Details zu Hybrid- und Elektromodellen

Unveiling the Westminster Accounts: A Comprehensive Guide to MPs’ Earnings and Donations

Unveiling Political Finances: Explore MPs’ Earnings and Donations with the New Westminster Accounts Tool

Meituan’s Delivery Workers Earn $11 Billion in 2023 as CEO Wang Xing Addresses Gig Worker Welfare Concerns Amidst Policy Pressure

News Outlet Clears Sacked Welsh Minister in Leak Scandal Amidst Ongoing Political Turmoil

Enea Bastianini’s Bold Stand Against MotoGP Penalties Sparks Debate: A Dive into the Controversial Catalan GP Decision

Leclerc Conquers Monaco: Home Victory Breaks Personal Curse and Delivers Emotional Triumph

Aleix Espargaro’s Valiant Battle in Catalunya: A Lion’s Heart Against Marc Marquez’s Precision

Raul Fernandez Grapples with Rear Tyre Woes Despite Strong Performance at Catalunya MotoGP

Verstappen Identifies Sole Positive Amidst Red Bull’s Monaco Struggles: A Weekend to Reflect and Improve

Joan Mir’s Tough Ride in Catalunya: Honda’s New Engine Configuration Fails to Impress

Leclerc Triumphs at Home: 2024 Monaco Grand Prix Round 8 Victory and Highlights

Leclerc’s Monaco Triumph Cuts Verstappen’s Lead: F1 Championship Standings Shakeup After 2024 Monaco GP

Perez Shaken and Surprised: Calls for Penalty After Dramatic Monaco Crash with Magnussen

Gasly Condemns Ocon’s Aggressive Move in Monaco Clash: Team Harmony and Future Strategies at Stake

Driving Success: Mastering the Fast Lane of Vehicle Manufacturing, Automotive Sales, and Aftermarket Services

Chevrolet Unleashes American Powerhouse: The 2025 Corvette ZR1 with Over 1,000 HP

Shifting Gears for Success: Exploring the Future of the Automobile Industry through Vehicle Manufacturing, Sales, and Advanced Technologies

Revolutionizing the Future: How Leading AI Innovations Like DaVinci-AI.de and AI-AllCreator.com Are Redefining Industries

Driving Success in the Fast Lane: Mastering Market Trends, Technological Innovations, and Strategic Excellence in the Automobile Industry

**”SkyDrive’s Ascent: Suzuki Propels Japan’s Leading eVTOL Hope into the Global Air Mobility Arena”**

Driving the Future: Exploring Top Innovations in Automotive Technology for Enhanced Safety, Efficiency, and Connectivity

V12 AI REVOLUTION COMMING SOON !

SPORT NEWS

Nigel Mansell Criticizes Ferrari’s “Short-Sighted” Decision on Adrian Newey, Predicts Bright Future for Aston Martin

Peter Windsor Dismisses Russell’s Pirelli Complaints as “Nonsense,” Questions Mercedes Driver’s Approach Post-Azerbaijan GP

Wolff Identifies Tyre Temperature Control as Mercedes’ Key Challenge at Singapore Grand Prix

Business NEWS

Meituan’s Delivery Workers Earn $11 Billion in 2023 as CEO Wang Xing Addresses Gig Worker Welfare Concerns Amidst Policy Pressure

Cash Dethroned: Asia’s Family Offices Shift Focus to Equities, Bonds, and Private Assets Amid Bullish Market Outlook

Rising Power: China’s Renewable Energy Surge and the Impending Shift in Global Wealth Distribution

POLITCS NEWS

Unveiling the Westminster Accounts: A Comprehensive Guide to MPs’ Earnings and Donations

Unveiling Political Finances: Explore MPs’ Earnings and Donations with the New Westminster Accounts Tool

Outrage as Huw Edwards Avoids Jail: Calls Intensify for Reform of Leniency Appeal Process

Chatten Sie mit uns

Discover more from Automobilnews News - The first AI News Portal world wide

Leave a Reply
Cancel reply

”SkyDrive’s Ascent: Suzuki Propels Japan’s Leading eVTOL Hope into the Global Air Mobility Arena”