June 10, 2025 New research and high-profile failures are reigniting debate over what generative AI can actually do — and what it may never achieve. Apple researchers recently published two papers arguing that artificial general intelligence (AGI) is a myth and that current AI models, including those like ChatGPT, will never achieve true reasoning. But that view is being challenged by surprising new results from OpenAI’s latest models — even as older games like Atari chess continue to trip them up.
The Apple papers assert that AI systems rely heavily on statistical pattern matching and lack the core flexibility and understanding needed for general reasoning. “AI models do not reason — they simulate the appearance of reasoning,” the papers claim, adding that models trained on large-scale data cannot transfer knowledge in ways humans do.
Yet, a closed-door event involving top mathematicians told a different story. OpenAI’s o4-mini model — the latest in its GPT series — was tasked with solving difficult math problems created by experts for a new benchmark called “Frontier Math.” Researchers, including University of Virginia mathematician Ken Ono, watched as the model solved advanced problems in number theory, sometimes in minutes. “I’ve never seen this kind of reasoning before in models. That’s what a scientist does. That’s frightening,” Ono said.
The AI reportedly read relevant literature, attempted simplified versions of problems, and constructed final answers in real-time — mimicking how humans approach discovery. According to organizers, o4-mini solved most of the 50 hold-out problems designed specifically to test capabilities beyond training data.
But if that sounds like AGI, consider this: just weeks earlier, Tom’s Hardware reported that ChatGPT was trounced in a beginner-level chess match by a 1977 Atari 2600 emulator. The model made illegal moves, failed to adapt to simple tactics, and seemed unaware of the basic rules — sparking ridicule and doubt.
These conflicting results fuel a growing divide in the AI community. Some, like Apple, argue that no amount of scale can compensate for the lack of conceptual grounding. Others point to hybrid systems like DeepMind’s AlphaGeometry and AlphaEvolve — which combine logic, search, and learning — as signs that AI may be inching closer to real problem solving.
As quantum researcher and former OpenAI advisor Scott Aaronson put it: “People say GPT is just a next-token predictor. It doesn’t really learn or judge. But what are you just a? A bundle of neurons and synapses?”
For now, it appears AI’s greatest strength — mimicking intelligence — is also its greatest liability. It can impress or fail spectacularly, depending on the domain, the task, and how you ask.
