I’m going on a staycation this weekend, but I wanted to get this out so I’m not distracted: llama-3-MOE.
This is a departure from previous MOEs I’ve done. This uses
@deepseek_ai
’s MoE architecture, and not Mixtrals. There is no semantic routing, and there is no gate. All 4