April 16, 2024


Chatbots have an alarming propensity to generate false information, but present it as accurate. This phenomenon, known as AI hallucinations, has various adverse effects. At best, it restricts the benefits of artificial intelligence. At worst, it can cause real-world harm to people.

As generative AI enters the mainstream, the alarm bells are ringing louder. In response, a team of European researchers has been vigorously experimenting with remedies. Last week, the team unveiled a promising solution. They say it can reduce AI hallucinations to single-figure percentages.

The system is the brainchild of Iris.ai, an Oslo-based startup. Founded in 2015, the company has built an AI engine for understanding scientific text. The software scours vast quantities of research data, which it then analyses, categorises, and summarises.  

Customers include the Finnish Food Authority. The government agency used the system to accelerate research on a potential avian flu crisis. According to Iris.ai, the platform saves 75% of a researcher’s time.

The <3 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol’ founder Boris, and some questionable AI art. It’s free, every week, in your inbox. Sign up now!

What doesn’t save their time is AI hallucinating.

“The key is returning responses that match what a human expert would say.

Today’s large language models (LLMs) are notorious for spitting out nonsensical and false information. Endless examples of these outputs have emerged in recent months.

Sometimes the inaccuracies cause reputational damage. At the launch demo of Microsoft Bing AI, for instance, the system produced an error-strewn analysis of Gap’s earnings report.

At other times, the erroneous outputs can be more harmful. ChatGPT can spout dangerous medical recommendations. Security analysts fear the chatbot’s hallucinations could even drive malicious code packages towards software developers.

“Unfortunately, LLMs are so good in phrasing that it is hard to distinguish hallucinations from factually valid generated text,” Iris.ai CTO Victor Botev tells TNW. “If this issue is not overcome, users of models will have to dedicate more resources to validating outputs rather than generating them.”

AI hallucinations are also hampering AI’s value in research. In an Iris.ai survey of 500 corporate R&D workers, only 22% of respondents said they trust systems like ChatGPT. Nonetheless, 84% of them still use ChatGPT as their primary AI tool to support research. Eek.

These problematic practices spurred Iris.ai’s work on AI hallucinations.

Iris.ai uses several methods to measure the accuracy of AI outputs. The most crucial technique is validating factual correctness. 

“We map out the key knowledge concepts we expect to see in a correct answer,” Botev says. “Then we check if the AI’s answer contains those facts and whether they come from reliable sources.”

A secondary technique compares the AI-generated response to a verified “ground truth.” Using a proprietary metric dubbed WISDM, the software scores the AI output’s semantic similarity to the ground truth. This covers checks on the topics, structure, and key information. 

Another method examines the coherence of the answer. To do this, Iris.ai ensures the output incorporates relevant subjects, data, and sources for the question at hand — rather than unrelated inputs.

The combination of techniques creates a benchmark for factual accuracy.

“The key for us is not just returning any response, but returning responses that closely match what a human expert would say,” Botev says.

Iris.ai founders (left to right) Maria Ritola, Jacobo Elosua, Anita Schjøll Abildgaard, and Victor Botev