Alibaba Proposes Novel "Expert Divergence" Strategy for MoE Models to Combat Homogenization

Deep News02-28 18:11

A new challenge has been identified in Mixture of Experts (MoE) models, which are considered the standard architecture for large language models today, from GPT-5 to DeepSeek-V3. However, a phenomenon known as "Expert Homogenization" occurs during pre-training, where the dozens of experts within a model end up performing similar tasks instead of specializing as intended. This leads to inefficient use of model parameters and limits the scaling potential of MoE systems.

A research team has identified that the root cause is a lack of information guidance during the MoE pre-training process. To address this issue, a novel training strategy called Expert Divergence Learning has been introduced. This approach leverages the natural "domain labels" present in pre-training data to design a new auxiliary loss function. The function encourages tokens from different domains to exhibit distinct routing patterns, thereby guiding the experts to develop genuine specialized capabilities. This research, titled "Expert Divergence Learning for MoE-based Language Models," has been accepted at ICLR 2026.

The core insight challenges the assumption that overall diversity equals effective specialization. The team revealed a mathematical oversight in traditional MoE training. Existing load-balancing losses aim to increase total routing diversity but do so indiscriminately. They ensure all experts are utilized but ignore whether they are being used by the appropriate tokens, analogous to a company rewarding busywork without ensuring meaningful, distinct tasks are being performed.

The proposed method argues that true expert specialization should be built upon "domain differences." The goal is to mathematically steer the total routing diversity towards "inter-domain divergence." This led to the creation of the Expert Divergence Loss (EDL), a plug-and-play training objective. The design is based on a mathematical intuition: the total routing diversity can be deconstructed.

A key formula underpins the theory: Total Diversity (D_total) = Inter-Domain Diversity (D_inter) + Intra-Domain Diversity (D_intra). Traditional methods blindly maximize D_total. Without proper guidance, models tend to increase D_intra by having tokens from the same domain routed randomly, rather than increasing D_inter by separating tokens from different domains. The new EDL method precisely targets and maximizes D_inter. It creates a "repulsive force" between different domains, allocating the diversity budget to inter-domain differences and forcing functional divergence among experts.

The loss calculation involves three steps. First, aggregation: the algorithm computes the average routing distribution for tokens belonging to different domains within a training batch. Second, divergence computation: the Jensen-Shannon (JS) Divergence is used to measure the difference between the average routing distributions of different domains. A low JS divergence indicates expert overlap, while a high value indicates distinct expert groups. Third, optimization: the EDL's final objective is to maximize the JS divergence between all domain pairs. This introduces a strong repulsive signal into the gradient descent process, forcing the model to learn a routing strategy highly aligned with semantics.

An experiment was conducted to determine if finer-grained specialization yields better results. Two domain label systems were tested: a coarse-grained system with 3 classes and a fine-grained system with 49 specific topics. Results demonstrated a clear "granularity scaling law": models trained with the 49-class labels performed significantly better. This indicates that providing more specific instructions for expert分工 leads to stronger emergent specialized capabilities in the MoE model.

The research team conducted extensive pre-training from scratch on models of 3B, 8B, and 15B parameters, using 100 billion tokens. In pre-training loss comparisons, the Expert Divergence Learning method showed stable and significant improvements in language modeling loss. The models equipped with this new strategy comprehensively outperformed standard MoE baselines across seven major benchmarks. On the 15B model, the fine-grained strategy led to an average score improvement of over one percentage point, a substantial gain in pre-training that typically equates to a difference of tens of billions of tokens.

To visually demonstrate the effectiveness, the team used ternary simplex plots. In baseline models, points representing expert activation were clustered in the center, indicating that similar experts were activated regardless of the input domain. In contrast, models using the new method showed points diverging towards the vertices of the triangle, proving that experts for different domains had become distinct specialized groups.

Notably, the EDL calculation is computationally lightweight, involving only low-dimensional vector operations on router outputs. Experimental data showed that compared to standard MoE, the new method caused almost no drop in training throughput and introduced zero additional inference cost.

In summary, this work on Expert Divergence Learning does not rely on increasing computational power or altering model architecture. Instead, it rethinks the definition of an "expert" in MoE models by addressing the mathematical essence of the loss function. It demonstrates that leveraging the inherent "domain structure" within data as a supervisory signal is an efficient way to unlock the potential of MoE systems. Furthermore, this training paradigm, which fully utilizes the "multi-dimensional structural information" of corpora, may help overcome pre-training bottlenecks in an era of increasingly scarce high-quality data, pointing towards a new dimension for scaling.

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Comments

We need your insight to fill this gap
Leave a comment