Apple engineers show how fragile AI 'reasoning' can be

Apple engineers show how fragile AI ‘reasoning’ can be

For some time now, companies like OpenAI and Google have been touting advanced “reasoning” capabilities as the next big step in their latest artificial intelligence models. Now, however, a new study by six Apple engineers shows that the mathematical “reasoning” displayed by large, advanced language models can be extremely fragile and unreliable in the face of seemingly insignificant changes in common benchmark problems.

The fragility highlighted in these new findings helps support previous research suggesting that LLMs’ use of probabilistic pattern matching lacks the formal understanding of the underlying concepts necessary for truly reliable mathematical reasoning skills. “Current LLMs are not capable of true logical reasoning,” the researchers hypothesize based on these results. “Instead, they attempt to replicate the reasoning steps observed in their training data.”

Mix it up

In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models” – currently available as a preprint – the six Apple researchers start with the GSM8K’s standardized set of more than 8,000 math word problems at grade level, which is often used as a benchmark for complex reasoning skills in modern LLMs. They then take a new approach of modifying part of this test set to dynamically replace some names and numbers with new values. So a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about Bill getting 19 building blocks for his brother in the new GSM-Symbolic assessment.

This approach helps avoid any potential “data contamination” that can result from directly introducing static GSM8K questions into an AI model’s training data. At the same time, these accidental changes do not alter the actual difficulty of the inherent mathematical reasoning, meaning that the models should theoretically perform equally well when tested on GSM-Symbolic as on GSM8K.

Instead, when researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they found overall reduced average accuracy compared to GSM8K, with performance drops ranging between 0.3% and 9.2%. , depending on the model. The results also showed large variance across 50 separate runs of GSM-Symbolic with different names and values. Accuracy gaps of up to 15% between the best and worst runs were common within the same model, and for some reason changing the numbers tended to result in worse accuracy than changing the names .

This type of variance, both within the different GSM-Symbolic analyzes and in relation to the GSM8K results, is more than surprising since, as the researchers point out, “the overall reasoning steps necessary to solve a question remain the same” . The fact that such small changes lead to such variable results suggests to researchers that these models are not doing “formal” reasoning but rather are “attempts.”[ing] to perform a sort of pattern matching within the distribution, aligning given questions and solution steps with similar ones seen in the training data.

Don’t get distracted

Nevertheless, the overall variance displayed for the GSM-Symbolic tests was often relatively low overall. OpenAI’s ChatGPT-4o, for example, went from 95.2% accuracy on GSM8K to a still impressive 94.9% on GSM-Symbolic. This is a fairly high success rate using either criteria, regardless of whether the model itself uses any “formal” reasoning behind the scenes (although the total accuracy of many models has dropped precipitously when researchers added just one or two additional logical steps to the problems). ).

The LLMs tested, however, performed much worse when Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately inconsequential statements” to the questions. For this “GSM-NoOp” (short for “no operation”) reference set, a question about how many kiwis a person picks over several days could be modified to include the incidental detail that “five of them [the kiwis] were a little smaller than average.

The addition of these false leads led to what the researchers called “catastrophic performance drops” in accuracy compared to GSM8K, ranging from 17.5% to 65.7%, depending on the model tested. These massive drops in accuracy highlight the limitations inherent in using simple “pattern matching” to “convert statements into operations without truly understanding their meaning,” the researchers write.

BlackRock to acquire HPS Investment Partners for $12 billion

BlackRock to acquire HPS Investment Partners for $12 billion

French Prime Minister prepares for vote of no confidence

French Prime Minister prepares for vote of no confidence

Leave a Reply

Your email address will not be published. Required fields are marked *