Connect with us

AI

Apple Engineers Unveil the Fragility of AI’s ‘Reasoning’: A Deep Dive into Mathematical Weaknesses

Published

on

To go back to this article, go to My Profile and then click on View saved stories.

Apple Experts Expose the Fragility of AI ‘Reasoning’

Recently, tech giants such as OpenAI and Google have highlighted the sophisticated "reasoning" abilities of their newest AI technologies as a major advancement. However, a fresh analysis conducted by a team of six engineers from Apple has revealed that the complex mathematical "reasoning" abilities of these cutting-edge large language models are surprisingly fragile and inconsistent when minor modifications are made to standard benchmark challenges.

The vulnerability underscored by the recent findings reinforces earlier studies indicating that Large Language Models (LLMs) lack a foundational comprehension of concepts, relying instead on probabilistic pattern matching, which falls short for dependable mathematical reasoning. "Present LLMs lack the ability to perform authentic logical reasoning," the researchers propose from their findings. "Rather, they mimic the reasoning processes they have seen during their training phase."

Shaking Things Up

In the preprint paper titled "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models," a team of six researchers from Apple explore the boundaries of computational problem-solving in sophisticated language models. They do this by leveraging the GSM8K, a collection of over 8,000 elementary math word problems commonly employed to test the intricate reasoning abilities of these models. Their innovative methodology involves altering parts of this dataset by updating specific names and numbers with fresh details—for instance, transforming a problem where Sophie gifts 31 building blocks to her nephew in the original GSM8K set into one where Bill presents 19 building blocks to his brother in their newly devised GSM-Symbolic test.

This article was first published on Ars Technica, a reliable platform for news related to technology, analysis of tech policies, critiques, and beyond. Ars Technica operates under the ownership of Condé Nast, the same parent company as WIRED.

This method prevents the possibility of "data contamination" that might occur when the unaltered GSM8K questions are directly used to train an AI model. Meanwhile, these minor modifications do not impact the fundamental mathematical reasoning challenge, suggesting that in theory, models ought to achieve similar performance on both GSM-Symbolic and GSM8K.

Upon evaluating over 20 cutting-edge Large Language Models (LLMs) using the GSM-Symbolic dataset, the researchers observed a universal decline in accuracy compared to their performances on GSM8K, noting decreases ranging from 0.3% to 9.2% based on the specific model. Additionally, the study highlighted significant inconsistency in outcomes across 50 distinct trials of GSM-Symbolic that utilized varied names and values, with discrepancies as wide as 15% in accuracy commonly occurring between the highest and lowest scores for the same model. Interestingly, alterations in numerical data appeared to have a more detrimental effect on accuracy than modifications to names.

This level of inconsistency, observed both among various GSM-Symbolic attempts and when contrasted with outcomes from GSM8K, is unexpectedly high. This is particularly notable because, as highlighted by the scientists, the fundamental logical processes required to tackle a problem don't change. The occurrence of significant variations in results from minor modifications indicates to these researchers that these models aren't engaging in any sort of structured reasoning. Instead, they seem to be trying to match the questions and steps to solve them with comparable examples encountered during their training phase.

Maintain Focus

In the broader context, the fluctuations observed in the GSM-Symbolic assessments were generally minor. For example, when evaluating OpenAI's ChatGPT-4o, its performance slightly decreased from a 95.2 percent success rate on GSM8K to a commendable 94.9 percent on GSM-Symbolic. This demonstrates a consistent level of proficiency across both metrics, irrespective of whether the algorithm employs conventional reasoning processes. However, it's noteworthy that the performance of several models significantly declined when the test problems were slightly complicated by introducing one or two more steps of logic.

When researchers from Apple altered the GSM-Symbolic benchmark by incorporating statements that appeared pertinent but were actually immaterial into the questions, the performance of the evaluated LLMs significantly declined. In this adapted benchmark, named the "GSM-NoOp" benchmark (an abbreviation for "no operation"), a question regarding the quantity of kiwis gathered over several days could be tweaked to mention an irrelevant fact, such as "five of the kiwis were slightly smaller than usual."

Incorporating these misleading elements resulted in what the scientists described as "severe declines in effectiveness" in terms of accuracy when compared to GSM8K, with reductions varying from 17.5 percent to an astonishing 65.7 percent, based on the specific model evaluated. Such significant decreases in precision underscore the fundamental weaknesses of relying solely on basic "pattern recognition" to "translate statements into actions without genuinely grasping their significance," according to the study's authors.

In the case of the smaller kiwis, the majority of models tend to deduct these lesser fruits from the overall count. This behavior, the researchers speculate, likely stems from their training data, which contained comparable scenarios necessitating a shift towards subtraction operations. Such a "critical flaw," according to the researchers, indicates more profound problems in the reasoning abilities of these models, problems that cannot be resolved simply through fine-tuning or similar adjustments.

The Mirage of Comprehension

This latest study published in GSM-Symbolic doesn't present entirely novel findings within AI research. Comparable recent studies have also proposed that large language models (LLMs) do not truly engage in formal reasoning. Rather, they imitate this process by using probabilistic pattern recognition to match new inputs with the most similar examples from their extensive training data.

Recent studies underscore the vulnerability of mimicry-based systems, particularly when faced with prompts that slightly diverge from their training data. This situation further emphasizes the natural constraints of attempting complex reasoning tasks without a foundational understanding of the logic or context involved. As highlighted by Benj Edwards in an Ars Technica article from July, focusing on AI-generated video content:

A significant factor behind the attention GPT-4, developed by OpenAI, captured in the realm of text generation is its considerable scale. This scale allowed it to consume vast amounts of information during its training phase, leading to performances that seemingly indicate a deep comprehension and ability to replicate the complexity of the world. However, the essence of its effectiveness lies in its extensive knowledge base, which surpasses that of many humans, enabling it to astound through the innovative recombination of known concepts. As the reservoir of training data and computational power expands, it is anticipated that the field of AI, particularly in the domain of video synthesis, will advance towards what may be described as a semblance of understanding.

It seems we're experiencing a comparable phenomenon of perceived comprehension with the newest AI "reasoning" systems, and observing how this false impression can shatter when the system encounters unforeseen circumstances.

Gary Marcus, a specialist in artificial intelligence, has critiqued the recent GSM-Symbolic paper, positing that a significant advancement in AI's abilities hinges on the integration of genuine symbol manipulation. This entails representing knowledge in an abstract form, using variables and operations similar to those found in algebra and conventional computer programming. He suggests that without this capability, AI models will continue to exhibit a fragile form of "reasoning" that makes them prone to errors on math tests, unlike calculators.

This article was first published on Ars Technica.

Discover Similar Content…

Explore the World of Politics: Sign Up for Our Newsletter and Tune Into Our Podcast

A solution from an emergency room physician for the firearm crisis in the United States

Viewing: Antony Blinken propels American foreign relations into the modern era

Insights from an Avid Hinge User

Occasion: Be part of the Energy Tech Summit happening on October 10 in Berlin.

Additional Content from WIRED

Evaluations and Tutorials

© 2024 Condé Nast. All rights are protected. WIRED might receive a share of revenue from items bought via our website, as a component of our Affiliate Agreements with retail partners. Content on this website cannot be copied, shared, broadcast, stored, or utilized in any form without explicit written consent from Condé Nast. Choices regarding advertisements.

Choose a global website


Discover more from Automobilnews News - The first AI News Portal world wide

Subscribe to get the latest posts sent to your email.

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

SUBSCRIBE FOR FREE

Advertisement
Sports6 mins ago

Ferrari Trento Cheers On Lando Norris’ Celebratory ‘Spike’: Embracing Joy and Celebration in F1

Cars & Concepts36 mins ago

Alpine A110 R Ultime (2024): Der Preis sprengt alle Vierzylinder-Rekorde

Cars & Concepts1 hour ago

Das neue BMW 2er Gran Coupé (2025): Eleganz trifft auf High-Tech-Facelift

Cars & Concepts2 hours ago

FC Bayern’s Electrified Fleet: Stars Roll in Style with Audi’s Latest e-tron Models for 2024/25 Season

Automakers & Suppliers2 hours ago

Racing Ahead with the Prancing Horse: Unveiling Ferrari’s Supercar Innovations and Timeless Elegance

AI2 hours ago

Unmasking the Future: How Reality Defender’s AI Battles the Surge of Real-Time Deepfake Scams

Business2 hours ago

CATL’s Strategic Move: Launching R&D Centre in Hong Kong to Boost Global Innovation in EV Battery Production

AI2 hours ago

Peering Into the Soul: The Rise of Emotion-Reading Smart Glasses and Their Impact on Privacy and Self-Understanding

Business3 hours ago

Hong Kong Developers Resort to 2016 Pricing Amid Market Struggles: A Deep Dive into the Echo House Project and the State of the Housing Market

AI3 hours ago

From Chatbots to Shazam Collars: The New Frontier of Pet Communication Technology

Business3 hours ago

Exploring TikTok’s Robust 75% Revenue Surge in Europe and Beyond in 2023: Growth Drivers and Challenges

AI3 hours ago

Digital Doppelgängers: The Unregulated Creation of AI Chatbots from Real-Life Tragedies and Personalities

Moto GP3 hours ago

Michelin’s Strategic Play: Navigating Phillip Island’s New Asphalt Challenge at the 2024 Australian MotoGP

Business4 hours ago

Hong Kong Regulator Accuses Tycoon Dickson Poon of Insider Trading in 2019 PayPal-Linked Deal

AI4 hours ago

Apple Engineers Unveil the Fragility of AI’s ‘Reasoning’: A Deep Dive into Mathematical Weaknesses

Moto GP4 hours ago

Jack Miller’s Emotional Rollercoaster: From MotoGP Uncertainty to Pramac Lifeline

Business4 hours ago

China’s Great Wall Motors Set to Open Brazilian Factory in May, Focusing on Hybrid SUV Production: A Strategic Move Backed by Tax Incentives

Moto GP4 hours ago

Unyielding Respect: Jorge Lorenzo Reflects on ‘Brutal’ 2015 and Rivalry with Valentino Rossi

Politics3 months ago

News Outlet Clears Sacked Welsh Minister in Leak Scandal Amidst Ongoing Political Turmoil

Moto GP5 months ago

Enea Bastianini’s Bold Stand Against MotoGP Penalties Sparks Debate: A Dive into the Controversial Catalan GP Decision

Sports5 months ago

Leclerc Conquers Monaco: Home Victory Breaks Personal Curse and Delivers Emotional Triumph

Moto GP5 months ago

Aleix Espargaro’s Valiant Battle in Catalunya: A Lion’s Heart Against Marc Marquez’s Precision

Moto GP5 months ago

Raul Fernandez Grapples with Rear Tyre Woes Despite Strong Performance at Catalunya MotoGP

Sports5 months ago

Verstappen Identifies Sole Positive Amidst Red Bull’s Monaco Struggles: A Weekend to Reflect and Improve

Moto GP5 months ago

Joan Mir’s Tough Ride in Catalunya: Honda’s New Engine Configuration Fails to Impress

Sports5 months ago

Leclerc Triumphs at Home: 2024 Monaco Grand Prix Round 8 Victory and Highlights

Sports5 months ago

Leclerc’s Monaco Triumph Cuts Verstappen’s Lead: F1 Championship Standings Shakeup After 2024 Monaco GP

Sports5 months ago

Perez Shaken and Surprised: Calls for Penalty After Dramatic Monaco Crash with Magnussen

Sports5 months ago

Gasly Condemns Ocon’s Aggressive Move in Monaco Clash: Team Harmony and Future Strategies at Stake

Business5 months ago

Driving Success: Mastering the Fast Lane of Vehicle Manufacturing, Automotive Sales, and Aftermarket Services

Mobility Report5 months ago

**”SkyDrive’s Ascent: Suzuki Propels Japan’s Leading eVTOL Hope into the Global Air Mobility Arena”**

Cars & Concepts3 months ago

Chevrolet Unleashes American Powerhouse: The 2025 Corvette ZR1 with Over 1,000 HP

Cars & Concepts5 months ago

Porsche 911 Goes Hybrid: Iconic Sports Car’s Historic Leap Towards Electrification Revealed on May 28

Business5 months ago

Shifting Gears for Success: Exploring the Future of the Automobile Industry through Vehicle Manufacturing, Sales, and Advanced Technologies

Cars & Concepts5 months ago

Seat Leon (2024): Die Evolution des Spanischen Bestsellers – Neue Technik, Bewährtes Design

AI5 months ago

Revolutionizing the Future: How Leading AI Innovations Like DaVinci-AI.de and AI-AllCreator.com Are Redefining Industries

V12 AI REVOLUTION COMMING SOON !

Get ready for a groundbreaking shift in the world of artificial intelligence as the V12 AI Revolution is on the horizon

SPORT NEWS

Business NEWS

Advertisement

POLITCS NEWS

Chatten Sie mit uns

Hallo! Wie kann ich Ihnen helfen?

Discover more from Automobilnews News - The first AI News Portal world wide

Subscribe now to keep reading and get access to the full archive.

Continue reading

×