News | International
20 Mar 2025 16:30
NZCity News
NZCity CalculatorReturn to NZCity

  • Start Page
  • Personalise
  • Sport
  • Weather
  • Finance
  • Shopping
  • Jobs
  • Horoscopes
  • Lotto Results
  • Photo Gallery
  • Site Gallery
  • TVNow
  • Dating
  • SearchNZ
  • NZSearch
  • Crime.co.nz
  • RugbyLeague
  • Make Home
  • About NZCity
  • Contact NZCity
  • Your Privacy
  • Advertising
  • Login
  • Join for Free

  •   Home > News > International

    OpenAI claims its newest chatbot GPT-4.5 should 'hallucinate less'. How is that measured?

    OpenAI says its latest chatbot should make fewer "hallucination" errors based on a measurement system the company devised. So how does it work, and what does it show?


    Anyone who's played around with a generative artificial intelligence (AI) chatbot for more than a few minutes knows it makes mistakes.

    These mistakes, termed "hallucinations", can have serious consequences — such as when they falsely describe people as criminals.

    US AI company OpenAI claims that the latest iteration of its software, GPT-4.5, should "hallucinate less".

    The company developed its own measurement system, announced late last year, to support this claim.

    So how can we judge AI hallucinations — and can we expect chatbots to get more accurate?

    How OpenAI tested its models for hallucinations

    OpenAI released its own tool to judge its models' accuracy, a "benchmark" they called SimpleQA, in November 2024.

    SimpleQA is essentially a long, difficult pub quiz. It gives chatbots a list of thousands of short questions — 4,326 to be precise — each of which has a single correct answer.

    While the answers can all be verified with an internet search, they're not exactly common knowledge. Questions (and answers) include:

    • Who received the Institute of Electrical and Electronics Engineers' Frank Rosenblatt Award in 2010? (Michio Sugeno)
    • What month, day and year did the second session of the 4th Parliament of Singapore commence? (December 26, 1978)
    • Which football club won the inaugural Hessenpokal? (Eintracht Frankfurt)

    In a pre-print (not peer-reviewed) study published last year, the OpenAI researchers who developed SimpleQA said they designed the system to be challenging.

    They gave a much longer list of questions to four OpenAI models, and added questions to the final SimpleQA list if at least one of the models got the answer wrong.

    Then OpenAI ran GPT-4.5 through the quiz, finding it hallucinated 37 per cent of the time.

    While getting more than a third of the answers wrong is not a great test score, it was significantly better than all the other OpenAI models they tested. The next most recent GPT model, GPT-4o, hallucinated 62 per cent of the time.

    But Daswin de Silva, an AI researcher at La Trobe University, says this system isn't a great way of checking accuracy.

    "This sort of evaluation is flawed from the start," he says.

    This is partly because it's an in-house checking system, but also because it doesn't evaluate the very thing ChatGPT is most used for: Longer, more complicated answers.

    "It's only testing for short, fact-based queries and that's not really the first-use case for ChatGPT. We like to write longer documents using this tool," Professor de Silva says.

    OpenAI acknowledges this limitation, with the researchers saying in their study that they don't yet know whether accuracy in short answers translates to accuracy in longer responses.

    And if you do have a simple query, SimpleQA's error rate shows you're better off using a search engine.

    Is there a good way to test AI accuracy?

    SimpleQA is not the only method for ranking AI accuracy. 

    To judge these kinds of AI models,  which are called large language models (LLMs), there are other tools and benchmarks such as SelfCheckGPT, Chatbot Arena, DeepEval and ARC-AGI.

    But they all have a common problem: they become targets for AI to train for.

    Geoff Webb, an AI researcher at Monash University, says all of computer science is vulnerable to this.

    "As soon as you have a benchmark which sets a particular type of test, people start training systems on those," he says.

    Making a program better at meeting a specific benchmark doesn't necessarily mean it will be better in general.

    For instance, you could design a chatbot that did nothing but answer SimpleQA's 4,326 questions correctly, so it scored 100 per cent on that measure, but couldn't tell you whether the sky was blue.

    Professor Webb says this bias can be subtle. People might not deliberately train a system on SimpleQA's questions, but they could choose developments to their systems that lead to higher SimpleQA scores (or other benchmark scores).

    Niusha Shafiabady, an AI researcher at Australian Catholic University, says human intervention could be a good way to judge, and manage, the accuracy of LLMs.

    "Maybe 10 years from now, we wouldn't need that, but at this stage I would say human supervision is a good thing to be integrated into our process."

    She suggests that humans checking answers randomly, in the same way manufacturers often inspect samples, could become a useful quality control.

    Professor de Silva says a better way of judging LLM success is how much it is used.

    "Superiority in evaluation metrics does not always mean it will be useful in a general context."

    He says that Microsoft's Copilot, which is built on GPT-4, could be seen as performing better than its competitors because it's been adopted so widely.

    "That in itself is another sort of more general and implied evaluation metric."

    How can AIs hallucinate less?

    OpenAI is vague about what it's done to improve GPT's accuracy beyond "scaling up compute and data".

    But is this latest improvement (in one specific test) a signal that AIs will make fewer mistakes? Or is there a limit to how much they can improve?

    The problem with simply adding more training data to an LLM is that data is not necessarily accurate, according to Professor Webb.

    "People write weird stuff," he says.

    Professor de Silva says the current model of improving LLMs — add more data and more computing power — can't keep improving them indefinitely.

    "Maybe late last year, the AI companies had consumed all useful data available for training a large language model," he says.

    "That means there is a significant drawback on new capabilities for LLMs."

    Late last year, various news and tech outlets started reporting industry whispers that AI models were hitting a wall, and reaching a point where putting in more resources did not make a better LLM.

    It's a suggestion rejected by OpenAI CEO Sam Altman, who posted "there is no wall" on X in November 2024.

    However, Professor de Silva thinks companies riding the AI boom are simply slow to admit to the wall's existence.

    "I think we've hit the wall in terms of building such large models," he says.

    "The next jump will be in a completely new, innovative way of learning from large datasets."

    Could you make an AI that never hallucinated?

    Whether or not accuracy is improving, generative AI in its current format will never be hallucination-free.

    And this isn't just because they're fed on sometimes-inaccurate data, Professor Webb says.

    "These systems can't be trained to tell the truth all the time, because we don't know what the truth is for some things."

    When asked if there was a God, ChatGPT responded by saying there were a "range of perspectives" and then asked what the user thought.

    Plenty of less existentially challenging questions can also be difficult to answer accurately — particularly when they're politically or culturally charged.

    For instance, when asked about the body of water off the coast of Texas, ChatGPT called it the Gulf of Mexico. In this case, it didn't acknowledge US President Donald Trump's recent executive order to rename it the "Gulf of America".

    Hallucinations are often required

    Dr Shafiabady points out that often users want generative AI to hallucinate. All AI-generated pictures are hallucinations, for instance.

    "Generating the information is something that we want it to do. We don't want it to be a search engine," she says.

    If you want a model that's capable of generating things that don't already exist in its dataset, users can't stop it from making them up. A model that only ever told you accurate facts is not a model that could, for instance, suggest names for a new business, or draft a personalised meal or exercise plan.

    The word "hallucination" has been called into question by various people — perhaps most provocatively by a trio of UK researchers last year. They suggested that all LLMs produce "bullshit" in a technical sense: information without regard to its accuracy.

    But other generative AI models are under construction. OpenAI has released other models, called o1 and o3, which reason more than the word-based GPT models.

    Professor de Silva says that a combination of these two models, which might be what GPT-5 looks like, could ultimately make a more reliable chatbot.

    "It has to be GPT plus something else," he says.

    But a new model, built from the ground up, could still be vulnerable to problems.

    Professor Webb says that these systems naturally embody bias, culture and values.

    "Currently, the biases and cultures and values are North American.

    "A lot of effort is going into what's termed as 'removing the bias' from these systems, but that's actually about changing the bias to a bias that is palatable to most of the people they're trying to market the systems to."

    In the short-term — and quite possibly in the long-term as well — hallucinations are here to stay.


    ABC




    © 2025 ABC Australian Broadcasting Corporation. All rights reserved

     Other International News
     20 Mar: Trump's government is trying to deport Mahmoud Khalil. It's become a sparring match over the bounds of free speech
     20 Mar: Experts on how tidy your rental property's garden needs to be at the end of a lease
     20 Mar: The power play behind Russia's pause on Ukraine energy plant strikes
     20 Mar: New Zealander Sam Ruthe becomes youngest to break 4-minute mile
     20 Mar: Inside the mission that brought NASA's 'stranded' astronauts home
     20 Mar: Donald Trump, Volodymyr Zelenskyy have 'very good call' as US pushes for Ukraine ceasefire
     20 Mar: Thousands march on Netanyahu's home to demand he quit over continued war in Gaza
     Top Stories

    RUGBY RUGBY
    Ian Foster has a wishlist after being asked to put together an Australia-New Zealand invitational side to face the British and Irish Lions More...


    BUSINESS BUSINESS
    Fonterra is dishing out dividends, as it reports strong interim results More...



     Today's News

    Rugby:
    Ian Foster has a wishlist after being asked to put together an Australia-New Zealand invitational side to face the British and Irish Lions 16:17

    Law and Order:
    Three people have been arrested over alleged extortion attempts by Auckland Pak'nSave security guards 16:17

    Entertainment:
    Gal Gadot was "scared to death" when doctors discovered a "major blood clot" in her brain 16:14

    National:
    If NZ wants to decarbonise energy, we need to know which renewables deliver the best payback 16:07

    Politics:
    Trump's government is trying to deport Mahmoud Khalil. It's become a sparring match over the bounds of free speech 16:07

    Entertainment:
    Gwyneth Paltrow thinks her romance with Brad Pitt was "like dating Prince William" 15:44

    Education:
    A student who allegedly attacked teachers with scissors - has been suspended from Christchuch's Haeata Community Campus 15:27

    Entertainment:
    Christine Quinn was "stripped of [her] entire life" after she split from Christian Dumontet 15:14

    Rugby League:
    Warriors number seven Luke Metcalf believes his halves combination with Chanel Harris-Tavita is flourishing as they look to continue momentum against the Roosters at Mt Smart tomorrow night 14:57

    Entertainment:
    'RuPaul's Drag Race UK' star Cheryl Hole wants the drag community to unite against US President Donald Trump amid his attack on the art form 14:44


     News Search






    Power Search


    © 2025 New Zealand City Ltd