Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

Introduction

Large Language Models (LLMs) have made remarkable strides in recent years, achieving state-of-the-art performance across a wide range of benchmarks. However, the performance of individual models is often constrained by their training data and architectural limitations. To address this, ensemble methods like Mixture-of-Agents (MoA) have been proposed, which combine outputs from multiple models to leverage their collective strengths. The underlying assumption is that diversity in model outputs can lead to better overall performance.

This paper challenges this assumption by asking: Is mixing different LLMs truly beneficial? We introduce Self-MoA, an ensemble method that aggregates outputs from only the single top-performing LLM. The selection of this model is based on its performance in specific tasks, assessed using benchmarks such as AlpacaEval 2.0, MMLU, CRUX, and MATH. Through extensive experimentation, we demonstrate that Self-MoA often outperforms standard MoA, achieving significant improvements across multiple benchmarks. Additionally, we explore the trade-offs between diversity and quality in MoA settings and identify scenarios where mixing different LLMs can be advantageous.

Key Findings

Our research yields several critical insights:

Self-MoA Outperforms Standard MoA: Self-MoA achieves a 6.6% improvement over standard MoA on the AlpacaEval 2.0 benchmark and an average improvement of 3.8% across benchmarks like MMLU, CRUX, and MATH.
Sensitivity to Model Quality: The performance of MoA is highly sensitive to the quality of the models being mixed. Mixing different LLMs often results in a lower average quality of outputs.
State-of-the-Art Performance: When applied to one of the top-ranking models in AlpacaEval 2.0, Self-MoA achieves state-of-the-art performance on the leaderboard.
Sequential Self-MoA: We introduce a sequential version of Self-MoA that aggregates outputs over multiple rounds. This approach is as effective as aggregating all outputs at once, offering flexibility in real-time applications.
Rare Benefits of Mixing LLMs: While mixing different LLMs can be beneficial in scenarios where models have complementary strengths, such cases are rare.

Understanding Self-MoA

Self-MoA is a novel ensemble method that focuses on aggregating outputs from only the single top-performing LLM. This approach contrasts with standard MoA, which combines outputs from multiple diverse models. The key idea behind Self-MoA is to prioritize quality over diversity, as our experiments reveal that the quality of outputs is a more critical factor in achieving superior performance.

In the Self-MoA approach, multiple outputs are generated from a single LLM by leveraging in-model diversity through repeated sampling. This process produces varied responses to the same prompt, with variability arising from the stochastic nature of the model's sampling, influenced by parameters like temperature settings. By aggregating these diverse outputs from the same model, Self-MoA enhances performance without requiring responses from different models.

Why Does Self-MoA Work?

Quality Over Diversity: By focusing on the top-performing model, Self-MoA ensures that the aggregated outputs maintain a high level of quality. This approach avoids the dilution of performance that can occur when mixing outputs from lower-quality models.
Reduced Complexity: Self-MoA simplifies the ensemble process by eliminating the need to balance the strengths and weaknesses of multiple models. This reduction in complexity can lead to more efficient and effective performance.
Scalability: The sequential version of Self-MoA allows for on-the-fly aggregation of outputs over multiple rounds, making it highly scalable and adaptable to real-world applications.

Trade-Offs Between Diversity and Quality

One of the central themes of this paper is the trade-off between diversity and quality in ensemble methods. While diversity can theoretically enhance performance by combining the strengths of different models, our findings suggest that quality is often the more critical factor.

Key Observations:

Diversity Can Lower Quality: Mixing outputs from different LLMs often results in a lower average quality, as the strengths of individual models may not align effectively.
Complementary Strengths Are Rare: Scenarios where mixing different LLMs leads to significant performance improvements are uncommon. In most cases, the benefits of diversity are outweighed by the drawbacks of reduced quality.

Sequential Self-MoA: A Flexible Approach

To address the limitations of traditional ensemble methods, we introduce a sequential version of Self-MoA. This approach allows for the aggregation of outputs over multiple rounds, making it highly adaptable to real-time applications.

Advantages of Sequential Self-MoA:

Real-Time Aggregation: Sequential Self-MoA can aggregate outputs on-the-fly, making it suitable for dynamic environments where immediate responses are required.
Scalability: This approach can handle a large number of LLM outputs without compromising performance, offering a scalable solution for complex tasks.
Consistency: Sequential Self-MoA achieves performance comparable to aggregating all outputs at once, ensuring consistent results across different scenarios.

Conclusion

This paper challenges the conventional wisdom that mixing different LLMs is always beneficial. Through the introduction of Self-MoA, we demonstrate that aggregating outputs from only the top-performing LLM can lead to superior performance in many scenarios. Our findings highlight the importance of prioritizing quality over diversity in ensemble methods and provide valuable insights into the trade-offs involved.

While mixing different LLMs can be advantageous in rare cases where models have complementary strengths, such scenarios are infrequent. The sequential version of Self-MoA offers a flexible and scalable approach to ensemble methods, making it a promising solution for real-world applications.

Source(s)

Original Research Paper: Rethinking Mixture-of-Agents