aliastasis Profile Banner
aliastasis Profile
aliastasis

@aliastasis

Followers
21
Following
133
Statuses
68

Working on LLMs/VLMs in my spare time. Making money with climate tech. Currently getting a Masters in CS @ LMU

München, Bayern
Joined January 2018
Don't wanna be here? Send us removal request.
@aliastasis
aliastasis
2 days
@jonasgeiping @Teknium1 @flowersslop @tomgoldsteincs But its not the same memory cost or is it? 32 steps in recurrence does not increase the memory requirement if I understood it correctly. But 32 CoT tokens does increase the memory requirement
1
0
0
@aliastasis
aliastasis
2 days
I dont think so (but I only skimmed through the paper), it looks like that they pass the “latent state” multiple times through the recurrent part. So when the current sequence has 50 tokens they pass the latent state back as input and the sequence length stays the same so to say (but correct me if I’m wrong) until they generate the next token after the recurrence has finished. But for me this seems more like improving the Transformer/GPT architecture than “test time scaling” as I don’t think this will scale the same way as R1/Ox test time scaling (I mean in the figure the accuracy increase flattens out). But one could combine them I guess
0
0
4
@aliastasis
aliastasis
4 days
Probably because it’s heavily trained on problems to solve and so the “mode” is that it always assumes you provide a actual task/question when your prefix is one which indicates that you gonna state the task afterwords (so “I have a question for you” indicates/implies that a question will follow etc)
0
0
2
@aliastasis
aliastasis
4 days
@kimmonismus So it’s basically not important if these models get really up to the generalisation power of humans, as we can just generate enough densily sampled tasks to cover the “software engineering” distribution
0
0
2
@aliastasis
aliastasis
4 days
I could imagine that the performance per parameter could also increase dramatically with this (but also for current pre-trained LLMs fine-tuned this way) as now the parameters can be used mainly for reasoning and not for storing lots of (unnecessary) knowledge
0
0
0
@aliastasis
aliastasis
4 days
@basedjensen A 5080? Cringe, I expect a 5090 from my gf
0
0
2
@aliastasis
aliastasis
5 days
@kimmonismus For example:
0
0
0
@aliastasis
aliastasis
5 days
@kimmonismus Also here, as Yoshua Bengio stated recently, when a company will reach AGI they probably wont release it or talk about it. They will use it to build companies which compete with the rest of the world…
0
0
0
@aliastasis
aliastasis
5 days
@kimmonismus I mean the thing with infrastructure is that it costs enormous amounts of money. And the ones who get the money are the ones who proved that they can build sota models with it. I think we need here the same. Just to proof that it’s worth the investment here in Germany/EU
0
0
1
@aliastasis
aliastasis
5 days
And yes I know 150 H100/A100 (its 120 H100 actually and 20 A100, so 140 in total not 150. Looked it up again shortly) are not that much but I guess one needs to first “proof” (to get large amounts of funding) that a team is able to eventually compete with OAI and for such things it could help (by training first smaller models as POC).
0
0
0
@aliastasis
aliastasis
5 days
@rasbt But yeah would be interesting how alternatives to these “dominant” tokens would perform
0
0
0
@aliastasis
aliastasis
6 days
@kimmonismus Finde die Idee super dass man sich irgendwie als Community zusammen tut. Misse das auch hier und wäre dabei!
1
0
4
@aliastasis
aliastasis
6 days
Datenschutzrechtliche Aspekte sind denke ich auch ein großer Punkt (falls noch nicht genannt) bzw. die rechtliche Unsicherheit bzgl. Trainingsdaten oder auch synthetische Daten von pretrained LLMs (was ist wenn diese u.a. mit Daten trainiert wurden, die man unter deutschem Recht nicht so einfach hätte benutzten dürfen -> darf man dann auch keine synthetischen Daten mit diesen Modellen generieren und kommerziell verwenden?). Vermutlich fällt das aber eh unter die Bürokratie Kategorie.
0
0
3