Exploring Prompting Methods and External Tools Impact on LLM Hallucinations

This paper explores how different prompting methods and the use of external tools affect the "hallucination" rate (generation of inaccurate or fabricated information) of Large Language Models (LLMs). The authors empirically evaluate various prompting strategies and agent frameworks on benchmark datasets to understand how to minimize these inaccuracies. (Barkley and van der Merwe, 2024)

Key Points

Several prompting techniques, including Chain-of-Thought (CoT), Self-Consistency (SC), Tree-of-Thoughts (ToT), Multiagent Debate (MAD), Reflection, Chain-of-Verification (CoVe), Knowledge Graph-based Retrofitting (KGR), and DuckDuckGo Augmentation (DDGA), were implemented and tested using the Meta-Llama 3 8B model.
These techniques were evaluated on benchmark datasets like Grade School Math 8K (GSM8K), TriviaQA, and Massive Multitask Language Understanding (MMLU) to assess their effectiveness in reducing hallucinations across different NLP tasks.
The study also investigated the impact of tool-calling agents (LLMs augmented with external tools like Wikipedia, DuckDuckGo, and a Python interpreter) on hallucination rates, finding that while tools can be beneficial, they can also increase hallucinations if the model isn't sufficiently robust.
The research indicates that the optimal prompting strategy is context-dependent, with simpler methods like Self-Consistency sometimes outperforming more complex ones.

The authors conclude that the effectiveness of different prompting strategies for mitigating LLM hallucinations varies depending on the specific task. While augmenting LLMs with external tools can extend their capabilities, it can also exacerbate hallucinations if the model's capacity is limited. Further research is suggested to explore the combination of different prompting strategies and to evaluate the hallucination rates of more advanced LLMs when using external tools.

Source(s):

Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models