Maths test stumps AI models: which number is bigger, 9.90 or 9.11?

The wave of artificial intelligence (AI) chatbots allowed for public use in mainland China enables many users to create new content – including audio, code, images, simulations, videos and grammatically correct text – to entertain and help with everyday tasks.

That demand has led to the local development of more than 200 large language models (LLMs), the technology underpinning generative AI (GenAI) services like ChatGPT. LLMs are deep-learning AI algorithms that can recognise, summarise, translate, predict and generate content using very large data sets.

In spite of such resources behind chatbots, AI models have been proven to struggle with basic maths knowledge this past weekend on the Chinese reality show Singer 2024, a singing competition produced by Hunan Television.

Mainland artist Sun Nan received 13.8 per cent of online votes to edge out US singer Chanté Moore, who received 13.11 per cent of votes. Some local netizens poked fun at the ranking, claiming that the latter number was larger. Ask AI, one commenter suggested. The results they got were mixed.

05:03

How does China’s AI stack up against ChatGPT?

Both Moonshot AI’s chatbot Kimi and Baichuan’s own Baixiaoying initially gave the wrong answer. They corrected themselves, as well as apologised, after the user who made the query adopted a so-called chain-of-thought approach – a reasoning method in which an AI application is guided step-by-step through a problem.

Alibaba Group Holding’s Qwen LLM used a Python Code Interpreter to calculate the answer, while Baidu’s Ernie Bot took six steps to get the correct answer. Alibaba owns the South China Morning Post. ByteDance’s Doubao LLM, by contrast, generated a direct response with an example: “If you have US$9.90 and US$9.11, clearly US$9.90 is more money.”

“LLMs are bad at maths – it’s very common,” said Wu Yiquan, a computer science researcher at Zhejiang University in Hangzhou.

GenAI does not inherently possess mathematical capabilities and can only predict answers based on training data, according to Wu. He said some LLMs perform well on maths tests possibly because of “data contamination”, which means that the algorithm memorised the answers because similar questions were already in its training data.

“The world of AI is tokenised – numbers, words, punctuations and spaces are all treated the same,” Wu said. “Therefore, any change in the prompt can affect the result significantly.”

The maths issue shows that AI technology continues to evolve not only on the mainland, but elsewhere around the world.

“The vast majority of experts believe the timing to craft unified national AI legislation may not yet be right since the technology is evolving so rapidly,” Zheng said.

The “number comparison testing” for AI models went viral after Allen Institute’s researcher Bill Yuchen Lin and tech firm Scale AI’s prompt engineer Riley Goodside highlighted the technology’s basic maths inadequacies on social media platform X.

When asked which number was bigger, 9.9 or 9.11, advanced LLMs such as OpenAI’s GPT-4o, Claude 3.5 Sonnet and Mistral AI answered 9.11.

In a post on X, Goodside said he does not intend to undermine LLMs, but aims to help understand and fix their failures.

“Previously well-known issues in LLMs (e.g., bad maths) are now mitigated so well the remaining errors are newly shocking to users – any reduction in frequency is also a delayed increase in severity,” he wrote. “We should be ready for this to keep happening across many task domains.”

Source link

Maths test stumps AI models: which number is bigger, 9.90 or 9.11?

Trump Reaffirms Support for Crypto, Plans to Launch 4th NFT Collection – Featured Bitcoin News

Watch NPR’s live special coverage of the RNC tonight

Related Posts

Netflix is going all out for Squid Game season 2—and you’re about to feel it | News | Campaign Asia

Can cultural exchanges fix ‘ambiguity and uncertainty’ in China-India ties?

China says economists who spread ‘inappropriate’ views should be fired

BOOK REVIEW: Vertigo (The Rise and Fall of Weimar Germany)

AI’s transformative role: Making insurance accessible and affordable globally | e27

1,566 terror attacks killed 924 in Pakistan over past 10 months – Times of India

Watch NPR's live special coverage of the RNC tonight

SUV driver dies in crash into tree on 80th Avenue in Langley

NFT Market Sees Over 30% Decline in Weekly Sales – Markets and Prices Bitcoin News

Robert Kiyosaki Frustrated by ‘Lame Excuses’ to Avoid Buying Bitcoin – Foresees Significant Price Rise – Markets and Prices Bitcoin News

Democratizing Investment – The Power of Asset Tokenization in Real-World Assets – Op-Ed Bitcoin News

KiwiSaver 2024 report lands: all the moves, free to read Investment News | Investment News NZ

Friend.tech’s FRIEND Token Airdrop Faces Steep Decline Since Market Debut – Markets and Prices Bitcoin News

Chuck Woolery, ‘Love Connection,’ ‘Wheel’ game show host, dies at 83 – National | Globalnews.ca

New Elder Scrolls mod adds more than 160 new quests to Morrowind

Falcons HC expands on Michael Penix Jr., Kirk Cousins decision after win over Giants

Netflix is going all out for Squid Game season 2—and you’re about to feel it | News | Campaign Asia

Prices Rose Over 20% Under Joe Biden’s Administration

First Look Teaser for ‘Zero Day’ – A Cyber Attack Conspiracy Thriller | FirstShowing.net

Owner of Montreal jewelry store fights back after thieves crash car through shop in brazen robbery | CBC News

CATEGORIES

LATEST UPDATES

Welcome Back!

Retrieve your password