In this Q&A, we speak with Nikil Patel about his work as Research Assistant on an Innovate UK sponsored project focused on reducing hallucinations in Generative AI for conversational AI (cAI) applications. This collaborative project involved Algomo, YuLife, Natwest and the Universities of Edinburgh and Essex.
Nikil's role in the project involved researching hallucination detection techniques from academic literature, validating their performance with internal data, integrating them into the product’s workflow and overseeing the project's planning and execution.
Generative AI technology can sometimes generate incorrect or fabricated answers, known as 'hallucinations.' These errors can undermine customer trust and harm a company's reputation.
Our project aimed to approach this problem in two phases:
This project was a perfect fit for Algomo, the lead participant, as it enhanced their core product, a cAI platform, and addressed the market demand for trustworthy and reliable generative cAI.
This project provided a dual-faceted experience, integrating perspectives from both industry and academia. In the industry setting, I engaged with real-world data and practical challenges, while academia offered exposure to advanced research and emerging innovations. This combination ensured that our approach was both research-driven and applicable in real-world business settings.
To detect hallucinations in LLM-generated responses (Large Language Models, which are AI systems trained on vast amounts of text data to understand and generate human language), I use two primary methodologies Natural Language Inference (NLI) models and LLMs as evaluators.
My work involved leveraging a range of AI models, including:
I also explored the integration of these methods with traditional Natural Language Processing (NLP) techniques, including Similarity Matching and Exact Regex Matching, to enhance detection accuracy. To ensure detailed performance monitoring I incorporated LangWatch, an open-source tool. I utilised Label Studio, another open-source platform, for data labelling to systematically annotate and evaluate the model outputs.
The key finding from this project is that the existing hallucination detection methods, even when combined, do not perform as effectively in real-world applications as they do on established benchmarks. This highlights a critical gap between research-driven evaluations and real-world business needs.
To bridge this gap, the research community needs more complex, robust and industry-compatible datasets. This would help to accurately assess the effectiveness of hallucination detection methods and ensure their real-world applicability.
One of the most intriguing findings from my research is the growing consensus that LLM hallucinations may never be fully eliminated. As LLMs continue to advance in their capabilities, the nature of their hallucinations has also evolved—shifting from obvious, easily detectable errors to more subtle and elusive ones.
This presents an ongoing challenge for hallucination detection, requiring increasingly sophisticated methods to identify and mitigate these nuanced inaccuracies.
One of the key challenges was handling Algomo’s multilingual real-world data to create a suitable experimental dataset. This process was time-intensive, requiring meticulous curation and preprocessing to ensure data quality and relevance.
Despite the effort involved, this challenge was addressed by prioritizing a well-structured, case-specific dataset, even if limited in size, to maintain high quality and ensure meaningful evaluation outcomes.
Our findings suggest two key areas for improvement better datasets and improved AI reasoning.
The research community needs more robust, industry-relevant datasets to accurately benchmark the performance of existing hallucination detection methods. For effective hallucination mitigation, having reliable confidence scores and supplementary hallucination-related information is crucial.
Integrating advanced reasoning capabilities within LLMs may offer a promising direction for reducing and mitigating hallucinations, potentially leading to more reliable and context-aware outputs.
My work in this project provides valuable insights into hallucinations in real-world enterprise applications, offering a detailed analysis of their occurrence and impact.
By assessing the feasibility and limitations of existing hallucination detection methods, our research lays the foundation for developing more reliable AI-powered customer support systems.
These findings can guide future advancements in integrating more effective hallucination detection and mitigation strategies in enterprise applications, making AI-driven customer service smarter and more trustworthy.
For more information about the Institute for Analytics and Data Science (IADS) and its work, visit our web pages.