Anton McGonnell Profile
Anton McGonnell

@aton2006

Followers
677
Following
422
Media
14
Statuses
617
Explore trending content on Musk Viewer
@aton2006
Anton McGonnell
19 days
@corbtt This was obvious to us from the beginning. Mapping models to SRAM is incredibly inefficient. You can get lots of speed, but you need far too many systems to run one instance of one model. You also can't batch very much because KV cache size grows quadratically, which means more
11
13
146
@aton2006
Anton McGonnell
3 months
We did not have this model until yesterday, but in a day, our team deployed it to a live inference service at . We are the only provider running this model at 16-bit precision (storage and activations). And we are doing it on a single node, at 30 t/s. In
@SambaNovaAI
SambaNova Systems
3 months
📣 Hey, #Developers ! In 24 hours, we have the new Llama 3.1 405B running on our SN40L RDUs. Armed with higher memory capacity on our state-of-the-art architecture, we can run the model with: 🎯 The highest precision 🖥️ Fewer chips 💡 Less energy Sign up to our API Program now
Tweet media one
9
33
64
0
8
25
@aton2006
Anton McGonnell
5 months
Really happy to see this published. We have kept the magic of dataflow and our memory architecture a secret for too long. This is an important step for SambaNova and only the beginning as we keep sharing more and pushing the boundaries of AI systems.
0
7
23
@aton2006
Anton McGonnell
5 months
Officially the fastest inference system in the world.
@ArtificialAnlys
Artificial Analysis
5 months
Artificial Analysis has independently benchmarked @SambaNovaAI 's custom AI chips at 1,084 tokens/s on Llama 3 Instruct (8B)! 🏁 This is the fastest output speed we have benchmarked to date and >8 times faster than the median output speed across API providers of @Meta 's Llama 3
Tweet media one
7
48
127
0
5
22
@aton2006
Anton McGonnell
3 months
Running Llama3 405B at 114 t/s or running it at 16-bit precision would both be huge, industry-first announcements. We are doing both. 114 t/s at 16-bit precision. No other company in the world can do this.
@ArtificialAnlys
Artificial Analysis
3 months
SambaNova is serving Llama 3.1 405B at 114 output tokens/s with their custom chips! This is the fastest we have benchmarked and 4X faster than the median of providers on Artificial Analysis. Larger models with higher quality come at the cost of speed. New AI-focused custom
Tweet media one
10
26
161
0
4
20
@aton2006
Anton McGonnell
5 months
We already have the fastest system for generative inference in real world use cases that involve large inputs (>1k tokens). And we are doing this at full precision and with a single SN40L node. Faster than GPUs and faster than Groq. We will continue reducing TTFT and increasing
@SambaNovaAI
SambaNova Systems
5 months
We keep getting faster. The SambaNova Platform is now running Llama3 8B at 510 tokens/second and has reduced our time to first token by 33%. 🚀🚀 Still running at full precision, and still running on only 8 chips! SambaNova has the fastest end-to-end inference performance in the
10
30
140
1
3
17
@aton2006
Anton McGonnell
8 years
A very worrying look at NI's investment potential going forward. Something has to be done. #Brexit #NI
0
1
16
@aton2006
Anton McGonnell
2 months
Thanks @AndrewYNg ! We are the only provider in the world running Llama 405B at full precision and at over 100t/s. Get access here:
@AndrewYNg
Andrew Ng
2 months
I've been playing with @SambaNovaAI 's API serving fast Llama 3.1 405B tokens. Really cool to see leading model running at speed. Congrats to Samba Nova for hitting a 114 tokens/sec speed record (and also thanks @KunleOlukotun for getting me an API key!)
19
99
398
0
0
17
@aton2006
Anton McGonnell
5 months
Careless quantization kills model quality. @GroqInc have sacrificed quality to chase speed. And even then they need to run on hundreds of chips. @SambaNovaAI does not compromise on the model quality, whilst running on a single system.
@changran_hu
Changran Hu
5 months
Quantization hurts! Was playing llama3 8B from @SambaNovaAI and @GroqInc with some financial questions. The @SambaNovaAI api (bf16) get me answer correctly with $9.5 million, whereas @GroqInc (quantized) answer missed a “million” (🫨!) and also gets the number $1.8 wrong #llm #ai
Tweet media one
1
12
33
0
3
15
@aton2006
Anton McGonnell
5 months
Groq throwing more chips at their quantized version of Llama3 8B to beat our number. Whilst we run at full precision on a single server. This is fun!
0
1
13
@aton2006
Anton McGonnell
5 months
@ArtificialAnlys @GroqInc Plenty more room for @SambaNovaAI to optimize too, without running on hundreds of chips! Looking forward to seeing what you come up with next @GroqInc ! GPUs can't play this game like we can!
2
1
12
@aton2006
Anton McGonnell
5 months
@ArtificialAnlys @GroqInc Very important to note: @GroqInc quantize the model, which kills its quality. They sacrifice quality for speed. @SambaNovaAI runs over 1000 t/s at full precision.
Tweet media one
1
3
12
@aton2006
Anton McGonnell
7 months
@SambaNovaAI @DbrxMosaicAI @databricks @MistralAI @grok In case anyone is wondering what a Composition of Experts is - it is a collection of independent expert models that are behind a single endpoint with a router such that it provides a single model experience. Only possible on SambaNova's systems.
1
1
11
@aton2006
Anton McGonnell
6 months
We already have the fastest AI inference system in the world, and we will keep getting faster.
@SambaNovaAI
SambaNova Systems
6 months
3
15
59
1
2
10
@aton2006
Anton McGonnell
3 months
@_xjdr SambaNova runs 405B at BF16. Can get you an API key if you wanna try it out.
1
4
8
@aton2006
Anton McGonnell
8 months
We believe this to be a industry-defining paradigm shift. Samba-1, a 1T parameter Composition of Experts, will prove itself to be the most accurate, secure, scalable and cost-efficient approach to make enterprise AI ubiquitous. I am so grateful to be part of this team.
@SambaNovaAI
SambaNova Systems
8 months
Introducing Samba-1, the first one trillion (1T) parameter model for the regulated enterprise that can be fine-tuned, is private, secure, and 10X more efficient than any other model of its size. Samba-1 models have been trained across a variety of different use cases, tasks,
Tweet media one
0
4
30
0
4
9
@aton2006
Anton McGonnell
4 months
Quantization is search of speed is not free. Unless done very carefully, it will hurt the models quality. Meta spent millions of dollars to overtrain Llama3 8B way over the chinchilla scaling law, because it meant there was a highly performant model in a form factor that makes it
@SambaNovaAI
SambaNova Systems
4 months
How does our inference system compare to Groq’s on downstream tasks? To what extent does the difference in their precision vs. ours affect model performance? 🤔 We performed holistic and fair comparisons between the two systems on over 15 general knowledge tasks and found that,
Tweet media one
1
14
53
0
0
9
@aton2006
Anton McGonnell
7 months
@SakanaAILabs This is very exciting - are you thinking about only static merging i.e. build a single model from the different blocks, or are you also looking into dynamically merging blocks for sparse activation at inference time?
0
0
7
@aton2006
Anton McGonnell
6 months
@lifebypixels @SambaNovaAI @AIatMeta What do you mean? We have over 200B parameters running on this node. How many chips do you need to run 200B params?
1
0
8
@aton2006
Anton McGonnell
3 months
Agentic AI, Compound Model Systems, Compositions of Experts, or whatever you call lots of models and calls orchestrated together is the next phase of AI. This next phase needs ultra fast token generation, lots of model variety, and instantaneous models switching. Sign up!
@SambaNovaAI
SambaNova Systems
3 months
Are you looking to unlock lightning-fast inferencing speed at 1000+ tokens/sec on your own custom Llama3? Introducing SambaNova Fast API, available today with free token-based credits to make it easier to build AI apps like chatbots and more. Bring your own custom checkpoint for
0
8
25
1
1
8
@aton2006
Anton McGonnell
7 months
@jiayq @SambaNovaAI Thanks @jiayq , very excited about our partnership.
0
0
6
@aton2006
Anton McGonnell
6 months
@dylan522p I don’t get this logic exactly. The higher you batch, the less you save with MoE. In all likelihood, nearly all experts are loaded into memory when your batch number is higher than your number of experts assuming a normal distribution of request to expert routing. E.g in the
2
0
5
@aton2006
Anton McGonnell
6 years
Check out my latest article: The Missing Link Between Machine Learning & Enterprise via @LinkedIn
0
1
5
@aton2006
Anton McGonnell
10 years
#YoungInfluencers discussed on @queensradio this evening. The movement is gaining ground. Well done @zutch
0
1
5
@aton2006
Anton McGonnell
5 months
@altryne @SambaNovaAI @GroqInc @VentureBeat Groq quantizes the model though! Quality should not be sacrificed for speed!
1
0
6
@aton2006
Anton McGonnell
19 days
0
0
5
@aton2006
Anton McGonnell
3 months
@chris_j_paxton There would be no distilled data to train the new 70B and 8B versions without the 405B model.
0
0
5
@aton2006
Anton McGonnell
28 days
@tsarnick This simply isn’t true though.
1
0
5
@aton2006
Anton McGonnell
28 days
Quantization hurts
@ArtificialAnlys
Artificial Analysis
28 days
There has been a lot of discussion whether there is a measurable difference between FP8 & BF16. Our independent quality evaluations of Llama 3.1 405B providers shows that there is indeed a difference @SambaNovaAI and @hyperbolic_labs have both declared they are serving Llama 3.1
Tweet media one
5
23
128
0
0
4
@aton2006
Anton McGonnell
5 months
@MingranW @mingranw is the RDU whisperer
0
0
4
@aton2006
Anton McGonnell
1 month
@kimmonismus @SambaNovaAI is faster. Up to 570 t/s on 70B and up to 140 t/s on 405B.
1
0
4
@aton2006
Anton McGonnell
7 months
@IntuitMachine @bindureddy All of the models are open source, we are just showing the power of merging them together behind a single endpoint. The story here is that composing expert models together on a single system with crazy speed in the future!!!
1
0
3
@aton2006
Anton McGonnell
9 years
@datasentiment @PEdgar15 @Y_Influencers @theshawe Pot only gets bigger when we begin thinking commercially - need to start somewhere #YIVOTE
1
1
4
@aton2006
Anton McGonnell
1 month
@yar_vol @ArtificialAnlys @SambaNovaAI @AIatMeta This isn't true. We are not subsidizing this, whilst our competitors are. Anyone can post price lists and rate limits, but can they service demand? 'Very Low' versus 'High' rate limits are entirely relative.
1
0
4
@aton2006
Anton McGonnell
4 months
@JonathanRoss321 You talk about precision at activation but not at what precision weights at stored. Why don't you let everyone know the precision you stored weights? This effects accuracy.
0
0
4
@aton2006
Anton McGonnell
28 days
@ArtificialAnlys @hyperbolic_labs As you can see from the charts. @SambaNovaAI also run all Llama 3.1 models at BF16 precision.
0
0
4
@aton2006
Anton McGonnell
3 months
@gblazex @SambaNovaAI @ArtificialAnlys Thank you @gblazex - FYI we (SambaNova) have not quantized this model at all and are running it at 16-bit precision.
0
0
4
@aton2006
Anton McGonnell
19 days
@migtissera @casper_hansen_ @SambaNovaAI can actually make the economics work whilst providing crazy fast inference.
0
0
4
@aton2006
Anton McGonnell
1 month
0
0
3
@aton2006
Anton McGonnell
7 months
@sv_techie We aren’t trying to compete with the open LLM providers, we are trying to show that when you put lots of the open source expert models together, you unlock step function better capabilities. This runs better and at bigger scale on our chips than anywhere else.
1
0
3
@aton2006
Anton McGonnell
5 months
@rohanpaul_ai @GroqInc wow maybe you shouldn’t quantize all of the models you run to chase blindly chase speed? @SambaNovaAI manages to offer speed and accuracy, why can’t you?
1
1
3
@aton2006
Anton McGonnell
29 days
@KyleLiang5 never ceases to amaze
@VentureBeat
VentureBeat
30 days
SambaNova challenges OpenAI's o1 model with Llama 3.1-powered demo on HuggingFace
4
15
51
0
0
3
@aton2006
Anton McGonnell
6 months
We are very fast and the best part is that we do this with a single node and you can run hundreds of variants of Llama3 on this single node whilst still maintain this speed. And we are going to get much faster.
0
1
3
@aton2006
Anton McGonnell
5 months
@conanbr @SambaNovaAI @GroqInc @nvidia Thanks Thyago! We think we are more than decent - we don’t quantization the models like Groq, and we run at this speed on 10x fewer chips (maybe more)!!!
0
0
3
@aton2006
Anton McGonnell
5 months
0
0
3
@aton2006
Anton McGonnell
6 months
@dylan522p @jiayq ...and you can't just choose any random 8 vGPUs; you need the jobs to actually run on an 8-socket physical server due to interconnect. So fungibility becomes much much less granular and the economics for a company owning the infra themselves suddenly makes sense again.
0
0
1
@aton2006
Anton McGonnell
10 years
At #etigrowthandjobs event in stormont representing @Y_Influencers
1
1
3
@aton2006
Anton McGonnell
1 month
@vithursant19 @CerebrasSystems Weird claim to make when you aren't running 405B, which is the most accurate open source model in the world.
0
0
3
@aton2006
Anton McGonnell
28 days
@vokaysh @aidan_mclau Coming very soon
0
0
3
@aton2006
Anton McGonnell
5 months
@Sentdex These aren’t real though
0
0
3
@aton2006
Anton McGonnell
6 months
@SquashBionic What do you mean? ☺️
0
0
3
@aton2006
Anton McGonnell
5 months
@unclecode @GroqInc @JonathanRoss321 It is quantized though, so model quality is heavily compromised. If you want speed at full precision, try @SambaNovaAI
0
0
3
@aton2006
Anton McGonnell
7 months
@francoisfleuret Compositions of Experts ie multi agent system, ensembles, dynamic model merging etc. This is the next wave of innovation, lots of expert models orchestrated together as one.
1
0
3
@aton2006
Anton McGonnell
1 month
@_xjdr Have you tried ? 405B BF16 at 130t/s
0
0
3
@aton2006
Anton McGonnell
3 months
@drivelinekyle @simonw @hyperbolic_labs SambaNova provides the 16-bit version running at over 100 t/s!
@ArtificialAnlys
Artificial Analysis
3 months
SambaNova is serving Llama 3.1 405B at 114 output tokens/s with their custom chips! This is the fastest we have benchmarked and 4X faster than the median of providers on Artificial Analysis. Larger models with higher quality come at the cost of speed. New AI-focused custom
Tweet media one
10
26
161
1
0
3
@aton2006
Anton McGonnell
2 months
@zjasper666 @Teknium1 @lmsysorg @hyperbolic_labs @togethercompute @OpenRouterAI @SambaNovaAI runs 405B BF16 at 120t/s. Happy to give you an endpoint if you wanna add to the eval list.
0
0
2
@aton2006
Anton McGonnell
7 months
@gazorp5 @SambaNovaAI @DbrxMosaicAI @databricks @MistralAI @grok It isn't closed source, it is leveraging opensource entirely, we just haven't shared how yet but that is coming. Also, we do compare the closed source on Alpaca - We are very confident that we will climb much higher in upcoming releases.
1
0
2
@aton2006
Anton McGonnell
9 years
@corrineheaney @Y_Influencers Need to do more at 3rdlevel + sell ourselves better success there and apprenticeships etc will follow #YIVOTE
0
2
2
@aton2006
Anton McGonnell
5 months
@bindureddy @IntuitMachine Why don’t specialized models work? In what context?
1
0
2
@aton2006
Anton McGonnell
7 months
@erhartford @SambaNovaAI @DbrxMosaicAI @databricks @MistralAI @grok Wow you seem like fun. I have explained above, we make chips with large, fast and programmable memory architecture. GPUs cannot do this as scale. If you think it can be done on other systems, please explain how. Until then, my stance remains.
0
0
1
@aton2006
Anton McGonnell
6 months
@cis_female @mplappert @dylan522p Which requires probably 12 GPUs minimum. So you have 11 idle GPUs in this scenario that are unusable until the job finishes. So you save on power but not on utilization. Solved problem on RDUs though :D
1
0
2
@aton2006
Anton McGonnell
1 month
@CerebrasSystems Where is 405B?
0
0
2
@aton2006
Anton McGonnell
2 months
@appenz @semiDL @MLPerf This is exactly the problem with MLPerf. These are impractical benchmarks, designed for Nvidia, that do not represent the real world. SambaNova, Groq and Cerebras are all proving that real time inference is something GPUs cannot so well, and now the question is which of the
2
0
2
@aton2006
Anton McGonnell
9 years
@corrineheaney Like to see more of the 75% do apprenticeships which may suit them more. 30k in debt, degree in Sport, no prospects #YIVOTE
1
0
2
@aton2006
Anton McGonnell
10 years
Tweet media one
0
1
2
@aton2006
Anton McGonnell
7 months
@AndrewYNg @RichardAGetz @SelfInfinity @groq Running lots of models together in a single system with the ability to generate tokens very quickly and dynamically pipeline them together is what is needed. This is what SambaNova does uniquely.
0
0
2
@aton2006
Anton McGonnell
6 months
@dylan522p @cis_female @mplappert Yeah I was replying to doing that with one GPU is not feasible as you need to load everything into hbm, so you need the GPUs for memory thus may as well also use them for compute
0
0
2
@aton2006
Anton McGonnell
7 months
@appenz @swayambhoo But anything that involves harnessing the collective power of lots of expert models like multi-agents, ensembles, model merging etc is where our systems have a huge advantage and why we want to push to community in this direction.
0
0
2
@aton2006
Anton McGonnell
5 months
@sundeep @GroqInc Hey @sundeep - what precision are you guys storing and activating weights at?
1
0
2
@aton2006
Anton McGonnell
11 years
Quarter final of All-Ireland is a nothing game. Mark Sidebottom never ceases to amaze. #stoptalking
1
4
2
@aton2006
Anton McGonnell
7 months
@AlexanderDerve @SambaNovaAI @DbrxMosaicAI @databricks @MistralAI @grok This is built with opensource, we will explain how soon. We very much want to get this onto Chatbot Arena. CC @lmsysorg
0
0
2
@aton2006
Anton McGonnell
7 months
@IntuitMachine @SambaNovaAI Expert models on a single node and with crazy fast speed, and we will get much faster. We make our own chips!
0
0
1
@aton2006
Anton McGonnell
5 months
@mag_pl @altryne @SambaNovaAI @GroqInc @VentureBeat @ArtificialAnlys Artificial Analysis benchmarks are awesome, and are helping everyone discover the ground truth, but they don’t capture everything yet. Groq absolutely quantize, publicly stated in multiple places, and demonstrable by how it performs worse than Llama3 8B on SambaNova, Together etc
0
0
2