![Carter Profile](https://pbs.twimg.com/profile_images/1483924488244838405/Y6gDulyM_x96.jpg)
Carter
@carterwsmith
Followers
59
Following
3K
Statuses
133
@T3Bracketology from a UCI watcher for over 10 years, do you really think they are a bubble in? maybe if they beat duquesne and lost in the final to UCSD, but no other way
1
0
0
@BrigitMurtaugh The inline product is useful, but every time I try to use chat (which is very hard to access or add context using keyboard shortcuts), 50% of the time it tries to use some random function call that truncates the output and results in a meaningless change
1
0
0
@PeterLakeSounds @Citrini7 Itâs not about the difference of the outputs, itâs about what you think intelligence is
0
0
0
RT @alec_lewis: Sam Darnold has 13 games this season with a passer rating above 100. That's the second most in NFL history. Like, ever.âŚ
0
98
0
@coldhealing Few experience the pleasures of the big city and central illinois within 24 hours
1
0
26
A week before thanksgiving, Cerebras announced the lowest time-to-first-token latency running Llama 3.1 405B on their chips. Here's what they intentionally DIDN'T say : At first glance, 128K context length and prices of $6/million input tokens and $12/million output tokens sounds pretty convincing. But what's not reported is *their* dollar cost/token which renders their inference impractical for any real use case essentially. Here's the math breakdown: With 44GB sram/chip, running a 405B model at FP16 (2 bytes/parameter) needs ~1TB of memory. This includes all the weights (810GB) + some additional memory for activation memory, working memory for computations, etc. The KV cache for 128K tokens requires 2 (FP16) * 2 (K and V) * 16384 (405b hidden size) * 128,000 / 1e9 = 8GB of memory per user That means to run single user decoding on a 405B, you need ceil((1000 + 8) / 44) = 23 racks. 23 racks * $2.5m / rack = $57 million upfront cost to support 1 user. Each additional user adds ~$450k. Compared to inference, training is even worse since you need to store the gradients on chip as well. All of the above doesn't even factor in power usage. Each chip/system has 750W TDP, assuming continuous operation (24/7) + average electricity cost in US ~$0.12 per kWh, that leaves 750W Ă 23 racks = 17.25 kW which translates to roughly $1,490/month which is insignificant compared to hardware investments. Now, the context length limit also depends on the pipeline depth, but for Llama-70B running with 4 racks (the bare minimum to run on WSE) it's ~8K tokens due to the memory architecture. While you can increase context length by adding more servers to your pipeline, 8xH100s costs ~$240K to buy and can run llama 70B no problem vs cerebras setup to run the same model costs $10m+ to buy and comes with the context length limitation. This is all to say benchmarks aren't everything when choosing how to deploy to prod. They're often an idealized version that doesn't translate when real world constraints are factored in.
0
0
0