How can an MoE model be as fast as a 3.6B parameter model but have the quality of a 21B model?

Question

Accepted Answer

This is due to the difference between 'active' and 'total' parameters. An MoE model like gpt-oss-20b has 21 billion total parameters, representing its overall knowledge capacity. However, for any given token it processes, a routing mechanism only activates a small subset of those parameters—in this case, about 3.6 billion. Since inference speed is determined by the number of active parameters, the model performs like a much smaller one, while its quality remains a function of its larger, total parameter count.

Mixture of Experts (MoEs) in Transformers