Anton McGonnell @aton2006 profile

Anton McGonnell

@aton2006

Followers

677

Following

422

Media

14

Statuses

617

Product @SambaNovaAI #GenerativeAI

Palo Alto

Joined August 2009

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

Tuchel • 164446 Tweets

#FGO • 113600 Tweets

#sbhawks • 48780 Tweets

カズラドロップ • 47629 Tweets

バーニス • 38359 Tweets

#deprem • 36482 Tweets

#baystars • 34164 Tweets

FY RECAP BLANK SS2EP4 • 31218 Tweets

#BlankReactSS2Ep4 • 28528 Tweets

ZETA • 23817 Tweets

ソフトバンク • 19958 Tweets

ジェロニモ • 19709 Tweets

ホークス • 17286 Tweets

スタメン • 14523 Tweets

ジャイアンツ • 14382 Tweets

カノウさん • 12440 Tweets

ベイスターズ • 12014 Tweets

日本シリーズ • 10773 Tweets

ヘルナンデス

衆院選JNN序盤情勢調査

右京さん

ターシャ

いまみー

岸波白野

増田大輝

完封リレー

堀岡くん

完封負け

巨人打線

伊藤大海

吉野家コピペ

スマイルウィ

カズラちゃん

三振ゲッツー

サクラファイブ

スイパラ

ソフト図鑑

イーブン

ヴァイオレット

オベロン

ヤスアキ

アドバンテージ

まけほー

色メロエッタ

モンテス

レイエス

カドック

ムリアン

たかほー

横浜優勝

Last Seen Profiles

@zencollie

@brickaman

@smug_eco

@lilmossxo

@eqluxe

@beat_syn

@1DNIXE

@bom_dab

@jsrwsb

@Tak0yakiki

@ltgguy

@mkpat10

@OLMGanjam

@Mustafakarluklu

@misstolkien

@findomDevyn

@nobkz

@mnstatefair

@ScalawagAtLarge

@EmpriDomini

Anton McGonnell

@aton2006

19 days

@corbtt This was obvious to us from the beginning. Mapping models to SRAM is incredibly inefficient. You can get lots of speed, but you need far too many systems to run one instance of one model. You also can't batch very much because KV cache size grows quadratically, which means more

11

13

146

Anton McGonnell

@aton2006

3 months

We did not have this model until yesterday, but in a day, our team deployed it to a live inference service at . We are the only provider running this model at 16-bit precision (storage and activations). And we are doing it on a single node, at 30 t/s. In

SambaNova Systems

@SambaNovaAI

3 months

📣 Hey, #Developers ! In 24 hours, we have the new Llama 3.1 405B running on our SN40L RDUs. Armed with higher memory capacity on our state-of-the-art architecture, we can run the model with: 🎯 The highest precision 🖥️ Fewer chips 💡 Less energy Sign up to our API Program now

9

33

64

0

8

25

Anton McGonnell

@aton2006

5 months

Really happy to see this published. We have kept the magic of dataflow and our memory architecture a secret for too long. This is an important step for SambaNova and only the beginning as we keep sharing more and pushing the boundaries of AI systems.

SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and...

Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains...

arxiv.org

0

7

23

Anton McGonnell

@aton2006

5 months

Officially the fastest inference system in the world.

Artificial Analysis

@ArtificialAnlys

5 months

Artificial Analysis has independently benchmarked @SambaNovaAI 's custom AI chips at 1,084 tokens/s on Llama 3 Instruct (8B)! 🏁 This is the fastest output speed we have benchmarked to date and >8 times faster than the median output speed across API providers of @Meta 's Llama 3

7

48

127

0

5

22

Anton McGonnell

@aton2006

3 months

Running Llama3 405B at 114 t/s or running it at 16-bit precision would both be huge, industry-first announcements. We are doing both. 114 t/s at 16-bit precision. No other company in the world can do this.

Artificial Analysis

@ArtificialAnlys

3 months

SambaNova is serving Llama 3.1 405B at 114 output tokens/s with their custom chips! This is the fastest we have benchmarked and 4X faster than the median of providers on Artificial Analysis. Larger models with higher quality come at the cost of speed. New AI-focused custom

10

26

161

0

4

20

Anton McGonnell

@aton2006

5 months

We already have the fastest system for generative inference in real world use cases that involve large inputs (>1k tokens). And we are doing this at full precision and with a single SN40L node. Faster than GPUs and faster than Groq. We will continue reducing TTFT and increasing

SambaNova Systems

@SambaNovaAI

5 months

We keep getting faster. The SambaNova Platform is now running Llama3 8B at 510 tokens/second and has reduced our time to first token by 33%. 🚀🚀 Still running at full precision, and still running on only 8 chips! SambaNova has the fastest end-to-end inference performance in the

10

30

140

1

3

17

Anton McGonnell

@aton2006

8 years

A very worrying look at NI's investment potential going forward. Something has to be done. #Brexit #NI

0

1

16

Anton McGonnell

@aton2006

2 months

Thanks @AndrewYNg ! We are the only provider in the world running Llama 405B at full precision and at over 100t/s. Get access here:

Andrew Ng

@AndrewYNg

2 months

I've been playing with @SambaNovaAI 's API serving fast Llama 3.1 405B tokens. Really cool to see leading model running at speed. Congrats to Samba Nova for hitting a 114 tokens/sec speed record (and also thanks @KunleOlukotun for getting me an API key!)

19

99

398

0

17

Anton McGonnell

@aton2006

5 months

Careless quantization kills model quality. @GroqInc have sacrificed quality to chase speed. And even then they need to run on hundreds of chips. @SambaNovaAI does not compromise on the model quality, whilst running on a single system.

Changran Hu

@changran_hu

5 months

Quantization hurts! Was playing llama3 8B from @SambaNovaAI and @GroqInc with some financial questions. The @SambaNovaAI api (bf16) get me answer correctly with $9.5 million, whereas @GroqInc (quantized) answer missed a “million” (🫨!) and also gets the number $1.8 wrong #llm #ai

1

12

33

0

3

15

Anton McGonnell

@aton2006

5 months

Groq throwing more chips at their quantized version of Llama3 8B to beat our number. Whilst we run at full precision on a single server. This is fun!

0

1

13

Anton McGonnell

@aton2006

5 months

@ArtificialAnlys @GroqInc Plenty more room for @SambaNovaAI to optimize too, without running on hundreds of chips! Looking forward to seeing what you come up with next @GroqInc ! GPUs can't play this game like we can!

2

1

12

Anton McGonnell

@aton2006

5 months

@ArtificialAnlys @GroqInc Very important to note: @GroqInc quantize the model, which kills its quality. They sacrifice quality for speed. @SambaNovaAI runs over 1000 t/s at full precision.

1

3

12

Anton McGonnell

@aton2006

7 months

@SambaNovaAI @DbrxMosaicAI @databricks @MistralAI @grok In case anyone is wondering what a Composition of Experts is - it is a collection of independent expert models that are behind a single endpoint with a router such that it provides a single model experience. Only possible on SambaNova's systems.

1

11

Anton McGonnell

@aton2006

6 months

We already have the fastest AI inference system in the world, and we will keep getting faster.

SambaNova Systems

@SambaNovaAI

6 months

3

15

59

1

2

10

Anton McGonnell

@aton2006

3 months

@_xjdr SambaNova runs 405B at BF16. Can get you an API key if you wanna try it out.

1

4

8

Anton McGonnell

@aton2006

8 months

We believe this to be a industry-defining paradigm shift. Samba-1, a 1T parameter Composition of Experts, will prove itself to be the most accurate, secure, scalable and cost-efficient approach to make enterprise AI ubiquitous. I am so grateful to be part of this team.

SambaNova Systems

@SambaNovaAI

8 months

Introducing Samba-1, the first one trillion (1T) parameter model for the regulated enterprise that can be fine-tuned, is private, secure, and 10X more efficient than any other model of its size. Samba-1 models have been trained across a variety of different use cases, tasks,

0

4

30

0

4

9

Anton McGonnell

@aton2006

4 months

Quantization is search of speed is not free. Unless done very carefully, it will hurt the models quality. Meta spent millions of dollars to overtrain Llama3 8B way over the chinchilla scaling law, because it meant there was a highly performant model in a form factor that makes it

SambaNova Systems

@SambaNovaAI

4 months

How does our inference system compare to Groq’s on downstream tasks? To what extent does the difference in their precision vs. ours affect model performance? 🤔 We performed holistic and fair comparisons between the two systems on over 15 general knowledge tasks and found that,

1

14

53

0

9

Anton McGonnell

@aton2006

7 months

@SakanaAILabs This is very exciting - are you thinking about only static merging i.e. build a single model from the different blocks, or are you also looking into dynamically merging blocks for sparse activation at inference time?

0

7

Anton McGonnell

@aton2006

6 months

@lifebypixels @SambaNovaAI @AIatMeta What do you mean? We have over 200B parameters running on this node. How many chips do you need to run 200B params?

1

0

8

Anton McGonnell

@aton2006

3 months

Agentic AI, Compound Model Systems, Compositions of Experts, or whatever you call lots of models and calls orchestrated together is the next phase of AI. This next phase needs ultra fast token generation, lots of model variety, and instantaneous models switching. Sign up!

SambaNova Systems

@SambaNovaAI

3 months

Are you looking to unlock lightning-fast inferencing speed at 1000+ tokens/sec on your own custom Llama3? Introducing SambaNova Fast API, available today with free token-based credits to make it easier to build AI apps like chatbots and more. Bring your own custom checkpoint for

0

8

25

1

8

Anton McGonnell

@aton2006

7 months

@jiayq @SambaNovaAI Thanks @jiayq , very excited about our partnership.

0

6

Anton McGonnell

@aton2006

6 months

@dylan522p I don’t get this logic exactly. The higher you batch, the less you save with MoE. In all likelihood, nearly all experts are loaded into memory when your batch number is higher than your number of experts assuming a normal distribution of request to expert routing. E.g in the

2

0

5

Anton McGonnell

@aton2006

6 years

Check out my latest article: The Missing Link Between Machine Learning & Enterprise via @LinkedIn

0

1

5

Anton McGonnell

@aton2006

10 years

#YoungInfluencers discussed on @queensradio this evening. The movement is gaining ground. Well done @zutch

0

1

5

Anton McGonnell

@aton2006

5 months

@altryne @SambaNovaAI @GroqInc @VentureBeat Groq quantizes the model though! Quality should not be sacrificed for speed!

1

0

6

Anton McGonnell

@aton2006

19 days

@RhizoNymph Then use @SambaNovaAI

0

5

Anton McGonnell

@aton2006

3 months

@chris_j_paxton There would be no distilled data to train the new 70B and 8B versions without the 405B model.

0

5

Anton McGonnell

@aton2006

28 days

@tsarnick This simply isn’t true though.

1

0

5

Anton McGonnell

@aton2006

28 days

Quantization hurts

Artificial Analysis

@ArtificialAnlys

28 days

There has been a lot of discussion whether there is a measurable difference between FP8 & BF16. Our independent quality evaluations of Llama 3.1 405B providers shows that there is indeed a difference @SambaNovaAI and @hyperbolic_labs have both declared they are serving Llama 3.1

5

23

128

0

4

Anton McGonnell

@aton2006

5 months

@MingranW @mingranw is the RDU whisperer

0

4

Anton McGonnell

@aton2006

1 month

@kimmonismus @SambaNovaAI is faster. Up to 570 t/s on 70B and up to 140 t/s on 405B.

1

0

4

Anton McGonnell

@aton2006

7 months

@IntuitMachine @bindureddy All of the models are open source, we are just showing the power of merging them together behind a single endpoint. The story here is that composing expert models together on a single system with crazy speed in the future!!!

1

0

3

Anton McGonnell

@aton2006

9 years

@datasentiment @PEdgar15 @Y_Influencers @theshawe Pot only gets bigger when we begin thinking commercially - need to start somewhere #YIVOTE

1

4

Anton McGonnell

@aton2006

1 month

@yar_vol @ArtificialAnlys @SambaNovaAI @AIatMeta This isn't true. We are not subsidizing this, whilst our competitors are. Anyone can post price lists and rate limits, but can they service demand? 'Very Low' versus 'High' rate limits are entirely relative.

1

0

4

Anton McGonnell

@aton2006

4 months

@JonathanRoss321 You talk about precision at activation but not at what precision weights at stored. Why don't you let everyone know the precision you stored weights? This effects accuracy.

0

4

Anton McGonnell

@aton2006

28 days

@ArtificialAnlys @hyperbolic_labs As you can see from the charts. @SambaNovaAI also run all Llama 3.1 models at BF16 precision.

0

4

Anton McGonnell

@aton2006

3 months

@gblazex @SambaNovaAI @ArtificialAnlys Thank you @gblazex - FYI we (SambaNova) have not quantized this model at all and are running it at 16-bit precision.

0

4

Anton McGonnell

@aton2006

19 days

@migtissera @casper_hansen_ @SambaNovaAI can actually make the economics work whilst providing crazy fast inference.

0

4

Anton McGonnell

@aton2006

1 month

@CerebrasSystems 405B?

0

3

Anton McGonnell

@aton2006

7 months

@sv_techie We aren’t trying to compete with the open LLM providers, we are trying to show that when you put lots of the open source expert models together, you unlock step function better capabilities. This runs better and at bigger scale on our chips than anywhere else.

1

0

3

Anton McGonnell

@aton2006

5 months

@rohanpaul_ai @GroqInc wow maybe you shouldn’t quantize all of the models you run to chase blindly chase speed? @SambaNovaAI manages to offer speed and accuracy, why can’t you?

1

3

Anton McGonnell

@aton2006

29 days

@KyleLiang5 never ceases to amaze

VentureBeat

@VentureBeat

30 days

SambaNova challenges OpenAI's o1 model with Llama 3.1-powered demo on HuggingFace

4

15

51

0

3

Anton McGonnell

@aton2006

6 months

We are very fast and the best part is that we do this with a single node and you can run hundreds of variants of Llama3 on this single node whilst still maintain this speed. And we are going to get much faster.

0

1

3

Anton McGonnell

@aton2006

5 months

@conanbr @SambaNovaAI @GroqInc @nvidia Thanks Thyago! We think we are more than decent - we don’t quantization the models like Groq, and we run at this speed on 10x fewer chips (maybe more)!!!

0

3

Anton McGonnell

@aton2006

5 months

@j_neumatic @SambaNovaAI More to come @j_neumatic !

0

3

Anton McGonnell

@aton2006

6 months

@dylan522p @jiayq ...and you can't just choose any random 8 vGPUs; you need the jobs to actually run on an 8-socket physical server due to interconnect. So fungibility becomes much much less granular and the economics for a company owning the infra themselves suddenly makes sense again.

0

1

Anton McGonnell

@aton2006

10 years

At #etigrowthandjobs event in stormont representing @Y_Influencers

1

3

Anton McGonnell

@aton2006

1 month

@vithursant19 @CerebrasSystems Weird claim to make when you aren't running 405B, which is the most accurate open source model in the world.

0

3

Anton McGonnell

@aton2006

28 days

@vokaysh @aidan_mclau Coming very soon

0

3

Anton McGonnell

@aton2006

7 months

@erhartford @SambaNovaAI @DbrxMosaicAI @databricks @MistralAI @grok lol have a nice day Eric

1

0

3

Anton McGonnell

@aton2006

5 months

@Sentdex These aren’t real though

0

3

Anton McGonnell

@aton2006

6 months

@SquashBionic What do you mean? ☺️

0

3

Anton McGonnell

@aton2006

5 months

@unclecode @GroqInc @JonathanRoss321 It is quantized though, so model quality is heavily compromised. If you want speed at full precision, try @SambaNovaAI

0

3

Anton McGonnell

@aton2006

7 months

@francoisfleuret Compositions of Experts ie multi agent system, ensembles, dynamic model merging etc. This is the next wave of innovation, lots of expert models orchestrated together as one.

1

0

3

Anton McGonnell

@aton2006

1 month

@_xjdr Have you tried ? 405B BF16 at 130t/s

0

3

Anton McGonnell

@aton2006

3 months

@drivelinekyle @simonw @hyperbolic_labs SambaNova provides the 16-bit version running at over 100 t/s!

Artificial Analysis

@ArtificialAnlys

3 months

SambaNova is serving Llama 3.1 405B at 114 output tokens/s with their custom chips! This is the fastest we have benchmarked and 4X faster than the median of providers on Artificial Analysis. Larger models with higher quality come at the cost of speed. New AI-focused custom

10

26

161

1

0

3

Anton McGonnell

@aton2006

2 months

@zjasper666 @Teknium1 @lmsysorg @hyperbolic_labs @togethercompute @OpenRouterAI @SambaNovaAI runs 405B BF16 at 120t/s. Happy to give you an endpoint if you wanna add to the eval list.

0

2

Anton McGonnell

@aton2006

7 months

@gazorp5 @SambaNovaAI @DbrxMosaicAI @databricks @MistralAI @grok It isn't closed source, it is leveraging opensource entirely, we just haven't shared how yet but that is coming. Also, we do compare the closed source on Alpaca - We are very confident that we will climb much higher in upcoming releases.

1

0

2

Anton McGonnell

@aton2006

9 years

@corrineheaney @Y_Influencers Need to do more at 3rdlevel + sell ourselves better success there and apprenticeships etc will follow #YIVOTE

0

2

Anton McGonnell

@aton2006

5 months

@bindureddy @IntuitMachine Why don’t specialized models work? In what context?

1

0

2

Anton McGonnell

@aton2006

7 months

@erhartford @SambaNovaAI @DbrxMosaicAI @databricks @MistralAI @grok Wow you seem like fun. I have explained above, we make chips with large, fast and programmable memory architecture. GPUs cannot do this as scale. If you think it can be done on other systems, please explain how. Until then, my stance remains.

0

1

Anton McGonnell

@aton2006

6 months

@cis_female @mplappert @dylan522p Which requires probably 12 GPUs minimum. So you have 11 idle GPUs in this scenario that are unusable until the job finishes. So you save on power but not on utilization. Solved problem on RDUs though :D

1

0

2

Anton McGonnell

@aton2006

1 month

@CerebrasSystems Where is 405B?

0

2

Anton McGonnell

@aton2006

2 months

@appenz @semiDL @MLPerf This is exactly the problem with MLPerf. These are impractical benchmarks, designed for Nvidia, that do not represent the real world. SambaNova, Groq and Cerebras are all proving that real time inference is something GPUs cannot so well, and now the question is which of the

2

0

2

Anton McGonnell

@aton2006

9 years

#EUDebateNI

3.7M views · 15K reactions | Who will vote to leave the EU? | A...

A university graduate is more than twice as likely to want Britain to stay in the EU than someone with no qualifications. What else do we know about how...

www.facebook.com

0

1

Anton McGonnell

@aton2006

9 years

@corrineheaney Like to see more of the 75% do apprenticeships which may suit them more. 30k in debt, degree in Sport, no prospects #YIVOTE

1

0

2

Anton McGonnell

@aton2006

10 years

The BT delegation for @OneYoungWorld #OYW @bt_uk @btinireland http://t.co/rICHBBVJDy

0

1

2

Anton McGonnell

@aton2006

7 months

@AndrewYNg @RichardAGetz @SelfInfinity @groq Running lots of models together in a single system with the ability to generate tokens very quickly and dynamically pipeline them together is what is needed. This is what SambaNova does uniquely.

0

2

Anton McGonnell

@aton2006

6 months

@dylan522p @cis_female @mplappert Yeah I was replying to doing that with one GPU is not feasible as you need to load everything into hbm, so you need the GPUs for memory thus may as well also use them for compute

0

2

Anton McGonnell

@aton2006

7 months

@appenz @swayambhoo But anything that involves harnessing the collective power of lots of expert models like multi-agents, ensembles, model merging etc is where our systems have a huge advantage and why we want to push to community in this direction.

0

2

Anton McGonnell

@aton2006

5 months

@sundeep @GroqInc Hey @sundeep - what precision are you guys storing and activating weights at?

1

0

2

Anton McGonnell

@aton2006

11 years

Quarter final of All-Ireland is a nothing game. Mark Sidebottom never ceases to amaze. #stoptalking

1

4

2

Anton McGonnell

@aton2006

7 months

@AlexanderDerve @SambaNovaAI @DbrxMosaicAI @databricks @MistralAI @grok This is built with opensource, we will explain how soon. We very much want to get this onto Chatbot Arena. CC @lmsysorg

0

2

Anton McGonnell

@aton2006

7 months

@IntuitMachine @SambaNovaAI Expert models on a single node and with crazy fast speed, and we will get much faster. We make our own chips!

0

1

Anton McGonnell

@aton2006

5 months

@mag_pl @altryne @SambaNovaAI @GroqInc @VentureBeat @ArtificialAnlys Artificial Analysis benchmarks are awesome, and are helping everyone discover the ground truth, but they don’t capture everything yet. Groq absolutely quantize, publicly stated in multiple places, and demonstrable by how it performs worse than Llama3 8B on SambaNova, Together etc

0

2