Comparing Mistral LLM Models: Which One Excels in RAG Systems?

Introduction

In this post, I explore the comparative performance of Retrieval-Augmented Generation (RAG) systems by testing them with different Mistral AI models. My objective is to analyze the quality of responses generated by these models when integrated with a naive RAG setup.

RAG systems are an increasingly popular approach to leveraging large language models (LLMs) by combining them with external knowledge sources. This setup allows models to retrieve relevant information from a database and generate informed responses based on the retrieved documents.

RAG Setup

For this experiment, I used the ParentDocumentRetriever from Langchain, which is paired with a FAISS vector store as the document store. The retriever is a crucial component in the RAG pipeline, tasked with fetching relevant documents that the LLM will then use to generate responses.

As domain knowledge, I utilized a set of articles on RAG from arXiv, provided in PDF format. These articles were processed and incorporated into the FAISS vector store to create a comprehensive knowledge base for retrieval during the experiment.

The retrieval pipeline was kept intentionally simple with a naive RAG setup to focus on the effectiveness of different Mistral AI models in processing the retrieved content and answering questions accurately.

ParentDocumentRetriever: Balancing Specificity and Context

The ParentDocumentRetriever balances two goals when splitting documents: creating small chunks for accurate embeddings and keeping enough context for meaningful retrieval. It works by first retrieving the smaller chunks of data, which have precise meanings, and then looks up their parent documents to return the larger context. This ensures the retrieval remains both specific and contextually rich, without losing important information.

Mistral AI Models Compared

The following Mistral models were executed through the Mistral AI LLM API endpoint and compared based on their ability to generate coherent, accurate, and contextually relevant answers:

mistral-large-2407
mistral-small-2409
open-mixtral-8x22b
open-mixtral-8x7b

Judging Criteria

To ensure an unbiased evaluation of the output quality, I used a Mistral large model as the judge. The Mistral large was tasked with scoring the results generated by the different models based on a set of criteria:

Relevance: How well the response aligns with the retrieved documents.
Coherence: The clarity and logical flow of the response.
Accuracy: Whether the facts presented in the response match the context provided by the retrieved documents.

RAG Chain

The following workflow diagram illustrates the naive RAG chain used in this study.

Diagram generated using Excalidraw integration with Mermaid.

Conclusion

The mistral-large-2407 model consistently outperformed others in terms of relevance, coherence, and accuracy, making it the best choice for high-quality responses, although it also required significantly more time compared to smaller models. The mistral-small-2409 model offered good performance with faster response times, while the open-mixtral models were suitable for scenarios prioritizing speed over accuracy.

This study is very limited in the number of questions evaluated, which constrains its ability to derive general conclusions.

In future posts, I plan to provide more details on the configuration used for this study and also explore other advanced RAG approaches.

Results

The following results provide an overview of how different Mistral AI models performed on a series of questions related to Retrieval-Augmented Generation (RAG) systems. Each model was evaluated based on its ability to generate relevant, coherent, and accurate responses.

Question: Describe a taxonomy of RAG systems
- Model: mistral-large-2407, Score: 9.5, Time: 29.47 s
- Model: mistral-small-2409, Score: 9.5, Time: 13.81 s
- Model: open-mixtral-8x22b, Score: 8.5, Time: 7.98 s
- Model: open-mixtral-8x7b, Score: 8.5, Time: 4.26 s

Question: Explain in depth with all details what FlashRAG offers
- Model: mistral-large-2407, Score: 9.5, Time: 28.22 s
- Model: mistral-small-2409, Score: 9.5, Time: 15.98 s
- Model: open-mixtral-8x22b, Score: 9.5, Time: 9.73 s
- Model: open-mixtral-8x7b, Score: 8.5, Time: 7.79 s

Question: Give me pros and cons of RAG in comparison with model fine-tuning
- Model: mistral-large-2407, Score: 9.5, Time: 36.34 s
- Model: mistral-small-2409, Score: 9.5, Time: 13.14 s
- Model: open-mixtral-8x22b, Score: 9.5, Time: 11.6 s
- Model: open-mixtral-8x7b, Score: 8.5, Time: 8.75 s

Question: How does the context length limitation impact the retrieval process in RAG systems, and what strategies are used to mitigate it?
- Model: mistral-large-2407, Score: 9.5, Time: 18.55 s
- Model: mistral-small-2409, Score: 8.5, Time: 12.77 s
- Model: open-mixtral-8x22b, Score: 8.5, Time: 8.93 s
- Model: open-mixtral-8x7b, Score: 8.5, Time: 4.99 s

Question: What are the advantages and disadvantages of different document chunking strategies in RAG systems?
- Model: mistral-large-2407, Score: 9.5, Time: 24.51 s
- Model: mistral-small-2409, Score: 7.5, Time: 2.46 s
- Model: open-mixtral-8x22b, Score: 8.5, Time: 10.3 s
- Model: open-mixtral-8x7b, Score: 6.5, Time: 5.03 s

Question: How can RAG systems incorporate long-term memory to improve performance across multiple interactions?
- Model: mistral-large-2407, Score: 8.5, Time: 12.05 s
- Model: mistral-small-2409, Score: 2.5, Time: 2.34 s
- Model: open-mixtral-8x22b, Score: 8.5, Time: 6.09 s
- Model: open-mixtral-8x7b, Score: 8.5, Time: 5.71 s

Useful Links

Enjoyed this post? Found it helpful? Feel free to leave a comment below to share your thoughts or ask questions. A GitHub account is required to join the discussion.