"Exo's use of Llama 405B and consumer-grade devices to run inference at scale on the edge shows that the future of AI is open source and decentralized." -
@mo_baioumy
This figure from the o1 research report makes the case for a setup like this much stronger.
We can scale LLM capability with more inference time compute.
Search is sync-lite: you don't need ultra low latency / high bandwidth interconnect between your
@__tinygrad__
tinyboxes.
On the power efficiency of Apple M chips.
@exolabs_
with the MLX backend can get ~100% GPU utilisation out of each MacBook Pro M3 Max. The GPU has 128GB of memory (bandwidth ~400GB/s) and consumes ~40 Watts of power with ~30 TFLOPs of fp16 performance.
MacBook Pro M3 Max GPU:
-
More progress in the
@exolabs_
pytorch interface
- Tested with Qwen2 Instruct, TinyLlama and Llama3 8b and was able to produce tokens
- Currently not producing good output and gibberish. Looking at my top_p, top_k and temp along with my logit_sample method
- Testing on
Managed to run Llama 3.1 405b on my flight.
Uses
@exolabs_
to distribute the AI model across 2 MacBooks.
GPT-4o-level model running offline, so I can play games, ask questions and use a coding assistant in the air.
Got 2 MacBooks with me running distributed Llama-3.1-405b on
@exolabs_
ready for my flight.
Portable intelligence in the air, works fully offline with Thunderbolt 4 interconnect.