Comparing Mistral LLM Models: Which One Excels in RAG Systems?
A comparative study of Retrieval-Augmented Generation (RAG) systems using various Mistral AI models, with scoring conducted by a Mistral large model.
8 min read
Introduction
In this post, I explore the comparative performance of Retrieval-Augmented Generation (RAG) systems by testing them with different Mistral AI models. My objective is to analyze the quality of responses generated by these models when integrated with a naive RAG setup.
RAG systems are an increasingly popular approach to leveraging large language models (LLMs) by combining them with external knowledge sources. This setup allows models to retrieve relevant information from a database and generate informed responses based on the retrieved documents.
RAG Setup
For this experiment, I used the ParentDocumentRetriever from Langchain, which is paired with a FAISS vector store as the document store. The retriever is a crucial component in the RAG pipeline, tasked with fetching relevant documents that the LLM will then use to generate responses.
As domain knowledge, I utilized a set of articles on RAG from arXiv, provided in PDF format. These articles were processed and incorporated into the FAISS vector store to create a comprehensive knowledge base for retrieval during the experiment.
The retrieval pipeline was kept intentionally simple with a naive RAG setup to focus on the effectiveness of different Mistral AI models in processing the retrieved content and answering questions accurately.
ParentDocumentRetriever: Balancing Specificity and Context
The ParentDocumentRetriever
balances two goals when splitting documents: creating small chunks for accurate embeddings and keeping enough context for meaningful retrieval. It works by first retrieving the smaller chunks of data, which have precise meanings, and then looks up their parent documents to return the larger context. This ensures the retrieval remains both specific and contextually rich, without losing important information.
Mistral AI Models Compared
The following Mistral models were executed through the Mistral AI LLM API endpoint and compared based on their ability to generate coherent, accurate, and contextually relevant answers:
- mistral-large-2407
- mistral-small-2409
- open-mixtral-8x22b
- open-mixtral-8x7b
Judging Criteria
To ensure an unbiased evaluation of the output quality, I used a Mistral large
model as the judge. The Mistral large
was tasked with scoring the results generated by the different models based on a set of criteria:
- Relevance: How well the response aligns with the retrieved documents.
- Coherence: The clarity and logical flow of the response.
- Accuracy: Whether the facts presented in the response match the context provided by the retrieved documents.
RAG Chain
The following workflow diagram illustrates the naive RAG chain used in this study.
Diagram generated using Excalidraw integration with Mermaid.
Conclusion
The mistral-large-2407
model consistently outperformed others in terms of relevance, coherence, and accuracy, making it the best choice for high-quality responses, although it also required significantly more time compared to smaller models. The mistral-small-2409
model offered good performance with faster response times, while the open-mixtral
models were suitable for scenarios prioritizing speed over accuracy.
This study is very limited in the number of questions evaluated, which constrains its ability to derive general conclusions.
In future posts, I plan to provide more details on the configuration used for this study and also explore other advanced RAG approaches.
Results
The following results provide an overview of how different Mistral AI models performed on a series of questions related to Retrieval-Augmented Generation (RAG) systems. Each model was evaluated based on its ability to generate relevant, coherent, and accurate responses.
Question: Describe a taxonomy of RAG systems - Model: mistral-large-2407, Score: 9.5, Time: 29.47 s - Model: mistral-small-2409, Score: 9.5, Time: 13.81 s - Model: open-mixtral-8x22b, Score: 8.5, Time: 7.98 s - Model: open-mixtral-8x7b, Score: 8.5, Time: 4.26 s Question: Explain in depth with all details what FlashRAG offers - Model: mistral-large-2407, Score: 9.5, Time: 28.22 s - Model: mistral-small-2409, Score: 9.5, Time: 15.98 s - Model: open-mixtral-8x22b, Score: 9.5, Time: 9.73 s - Model: open-mixtral-8x7b, Score: 8.5, Time: 7.79 s Question: Give me pros and cons of RAG in comparison with model fine-tuning - Model: mistral-large-2407, Score: 9.5, Time: 36.34 s - Model: mistral-small-2409, Score: 9.5, Time: 13.14 s - Model: open-mixtral-8x22b, Score: 9.5, Time: 11.6 s - Model: open-mixtral-8x7b, Score: 8.5, Time: 8.75 s Question: How does the context length limitation impact the retrieval process in RAG systems, and what strategies are used to mitigate it? - Model: mistral-large-2407, Score: 9.5, Time: 18.55 s - Model: mistral-small-2409, Score: 8.5, Time: 12.77 s - Model: open-mixtral-8x22b, Score: 8.5, Time: 8.93 s - Model: open-mixtral-8x7b, Score: 8.5, Time: 4.99 s Question: What are the advantages and disadvantages of different document chunking strategies in RAG systems? - Model: mistral-large-2407, Score: 9.5, Time: 24.51 s - Model: mistral-small-2409, Score: 7.5, Time: 2.46 s - Model: open-mixtral-8x22b, Score: 8.5, Time: 10.3 s - Model: open-mixtral-8x7b, Score: 6.5, Time: 5.03 s Question: How can RAG systems incorporate long-term memory to improve performance across multiple interactions? - Model: mistral-large-2407, Score: 8.5, Time: 12.05 s - Model: mistral-small-2409, Score: 2.5, Time: 2.34 s - Model: open-mixtral-8x22b, Score: 8.5, Time: 6.09 s - Model: open-mixtral-8x7b, Score: 8.5, Time: 5.71 s
Useful Links
Enjoyed this post? Found it helpful? Feel free to leave a comment below to share your thoughts or ask questions. A GitHub account is required to join the discussion.