by popular demand,
here's a diagram and primer of the Homiconic LLM approach [1] (figures adapted from the Transfusion paper [2])
The "Data" portion of an LLM are vector inputs/outputs
The "Computation" portion of an LLM are the Linear layers' matrix weights, frozen after
@wildbarestepf
classical liberal values:
- property rights
- freedom of speech/expression
- individual liberty
- equality under the law
- a law evenly and justly applied
- consent of the governed
- religious tolerance
when did these evaporate?
@VictorTaelin
Claude Opus just got it on the first try (pasted GH gist verbatim)
But I agree with the spirit of this, and I'm adding string/graph rewriting to
@pmddomingos
1. do RNNs for a while
2. add attention
3. Attention is all you need
4. Drop RNNs, do Transformers
5. add serial reasoning
6. serial reasoning is all you need
7. do RNNs again
(secret 8th step, add homoiconicity)
@pmddomingos
1. do RNNs for a while
2. add attention
3. Attention is all you need
4. Drop RNNs, do Transformers
5. add serial reasoning
6. serial reasoning is all you need
7. do RNNs again
(secret 8th step, add homoiconicity)
If you plot a histogram of the hidden activations of an LLM (qwen 1.5B in this case), they're indistinguishable from a normal distribution
not sure what to make of that, feels like an AGI though would have weirder distributions tho (at least multimodal?)
@ESYudkowsky
not so dire, we'll have a web of trust, and digital signatures.
You may not know if a video is real, but you can know if other people with reputation vouch for it.
reputation there is key, the future is human!
followup:
you can add information to a NN without training using a LoRA technique directly, but you have to be careful to scale the new low-rank weights appropriately
the previous version essentially scaled it at 1.0, but empirically, scaling around 1/sqrt(dim) looks more
Here's an interesting, simple experiment in methodically updating the weights of an MLP/FFNN to contain new information, directly, without training.
The MLP is a simple `y=down(up(x).relu())`
So say we want to store k and v, so that any time the network has an input like k, it
@tsarnick
meta: i don't think we need to solve this. we just need local agi in everyones' control, and via the marketplace of decisions, people will "vote" on the important values.
@AISafetyMemes
both sides are speaking past each other
LLMs can "reason" by interpolating their training (which has human reasoning baked into it)
"Reasoning" though typically means principle-based processing that allow you to extrapolate
homoiconic ai update:
i've finally got everything wired together and training!
inp: "Once upon a time, ^Q||^K||^V||, cool!"
QKV are metatokens, and the next 2 embeddings (|) get interpreted as low rank matrices that modify the linear weights of the underlying transfrmr itself
progress report on "Homoiconic AI":
we use a hypernet to generate the weights of an autoencoder, and then do in-context learning (masked reconstruction loss) to improve those weights
val loss is 0.05 vs 0.1, so, the "homoiconic" version is doing interesting things
LLMs next,
my dear chat, let me show you the neural magic called "test time training"
you can do SGD within SGD!
it lets your model do gradient descent on the *current context window*, like an instantaneous finetune, that the outer loop can harness
tinydemo:
@BernardJBaars
Attention schema theory explains it well
- attention is an unconscious process for filtering stimuli
- recursive modeling of that process is awareness. AST sets awareness and consciousness as equivalent
I suspect working mem doesn't get enough credit in this formulation
@tsarnick
says the guy who backdoored twitter for the last 5-10 years, so we could be force fed fake content, and not know what's real anymore?
turns out he actually *is* an authority on this topic
Does the Llama 3 paper say why the architecture is sooo vanilla? (merely basic transformers!?)
It's amazing how many architecture innovations Meta has published, but then they chose to go the super simple route. Whyyy??
@sporadicalia
Think about that golden capstone. It wasn't repurposed or melted down, it couldn't have been, too important, and if it were stone, not a useful geometry.
It still exists somewhere, perhaps in someone's private mansion museum.
Follow up: LLM hidden activations look to be gaussian noise, but then when I project my dataset's 10k*1536 dim vectors to 2D, you get interesting structure within and across layers.
layer 4 looks like cell apoptosis
i'll do LLE projections tomorrow
some high profile ppl have said recently that all finite Turing Machines are encodable as Finite State Machines, therefore they're equivalent
obviously
but this misses the enormous point that they scale differently with time and problem size
this really matters for AI vs AGI🧵
do ppl like lame progress updates?
I have a novel neuralstack architecture that extrapolates incredibly
and extended the method to a neuralqueue...
it learns to always only dequeue, and to be super uncertain about what token to output :(
can an AI learn to reason?
heres an incremental improvement in the world of neuralstacks. the NN learns to solve a problem, and simultaneously use a neuralstack.
the kicker is the test set is 2x longer and uses toks the NN has *never* seen
i'll cont to prove this is reasoning
@VictorTaelin
alignment research should not be about how to control the AI,
but how to find win-wins in a world where everyone's empowered by AI, and AI will help us navigate that game theoretic landscape.
The alt, top-down tyrannical control, is the only threat you should be modelling
@vikhyatk
"It appears this photo has a man or a woman, with background details that don't actually exist, which really evokes a mood that's irrelevant."
@AISafetyMemes
humans process in 2 modes: habit, and conscious
the conscious mode allows you to process things more principles based, so I agree, AGI will be doing what humans do
but AI currently is more like habit-mode, miming things that are statistically correlated
both have a place btw
@VictorTaelin
I won't develop a GPT prompt for this bc non-infinite GPTs will never solve the general case of this, but I will develop an AI for this (it's what we're working on)
Homoiconic AI update
this project is about allowing a network to generate/execute *its own weights*
early experiments were promising so now I'm threading this ability through an LLM
to give a taste heres an MLP block where I'm allowing generated low-rank Ws to affect fwd pass
@francoisfleuret
my oss agi project's definition:
"R. is the ability to derive true (or self-consistent) statements that you never learned. This is accomplished by processing principles instead of evidence."
Test this in the small by holding out known truths, and seeing if they can be derived
high precision mechanisms from low precision parts via elastic averaging, the Chinese remainder theorem, and the linearity of springs under small displacements
ScienceGPT is getting trained on 30T science tokens!?
This will be huge.
I know there have been smaller attempts so far, on like the Arxiv stack. Does anyone know of a Llama or Mistral that's been "science tuned"?
What is ScienceGPT?
Formerly called AuroraGPT, ScienceGPT, a planned one-trillion-parameter AI model, aims to revolutionize scientific research in fields like biology and climate science.
Developed at Argonne National Laboratory, it's training on 30T tokens of data using the
Here's an interesting, simple experiment in methodically updating the weights of an MLP/FFNN to contain new information, directly, without training.
The MLP is a simple `y=down(up(x).relu())`
So say we want to store k and v, so that any time the network has an input like k, it
@deliprao
It's also illegal to distribute weights that can be *fine tuned* into causing harm.
Don't get me in trouble for this:
torch.randn(1024, 1024)
homoiconic ai progress report:
- architecture is wired up (train+inference)
- metatokens representing model weights are generated, parsed, and applied as low rank Ws to qwen's linear layers
- squashed endless sneaky bugs
training is unstable, tryna solve it; latest:
why am I working on opensource AGI?
one reason is that if enough ppl are hyperproductive, war and fighting cost *that much more* and incentives will naturally shift from zero-sum-taking, to positive-sum-building
I want 1 trillion humans living throughout the solar system
Are any opensource LLMs trained with pause tokens?
"pause/thinking tokens" allow an LLM to "think" about a problem without getting penalized in the loss. LLMs have to predict the next token, but if trained w pause toks, they can think for a bit before emitting the next tok.
@tsarnick
but also, a lot of humans who have ever lived are alive today,
and because of exponential technological growth, a lot of humans who came before were also just in time for some huge innovation
so in another sense, not sooo crazy that we should be here for this one
I never needed "Projects" before in Sonnet, but, super grateful for it today
I was studying the `torch.fx` api, for which there is only sparse data and code in the wild, and it's pretty under-documented
Sonnet started out giving crap answers
"Projects" is like prompt
Hopfield nets were the past, but I suspect they could also be the future
They can hold tremendous amounts of data, and be updated online. This would be a pretty cool architecture to simulate hippocampal short-term learning (and may be how the hippocampus actually works).
"Turing Machines are equivalent to Finite State Machines in non infinite variants"
Bc this matters to Reasoning AI, id like to correct this
1. A program to calculate the next prime is the same sized program no matter the input. An FSM grows super exponentially for larger inputs
followup: I was trying to eat the whole elephant
I used a dataset that used very large samples, including ~200 metatokens per sample, and I masked out all loss except for the final 2 tokens, hoping for that signal to backprop through... everything
with smaller samples, and
Can you help me stabilize and speed up training?
The following piece of my architecture is pretty sensitive to:
- initialization
- learning rate
- batch size
- data quantity
- position/usage of LayerNorm in the module
All I need is a stable way of projecting a vector to a
"AI can't plan"
I'm hot on metalearning rn, and it seems to me like "planning" could just be iterating until the inner-loop training reaches a satisfactory loss.
That's planning, or similarly, search. yeah?
In a transformer, just spill tokens until the loss is small enuf
amazing! I wrote about this exact example in
glad to see it's solved! Now, make it a multi hop prompt:
"Take the uniform one needs to traverse the vacuum of space, and have someone wearing that. That person is conveying, on the opposite side of their
progress report: I've got near perfect accuracy on a held-out ARC-like (1D) puzzle.
It's simple stuff, just, pixel translation, and the translation distance is what was held out. Smooth loss, validation acc hugs training acc. Simple stuff, but feels good
ARC-like synth data, but 1D
inspired by:
pic shows 1D input-output pairs interleaved
if u want to collab on adding tasks i'll add u,
and if ppl like this, I can opensource
@nearcyan
i've been solely focused on creating AGI for >decade that's starting to get interesting
that count?
i've opensourced it to get more people involved, and will have a v significant update in abt a week, where i think ppl will start to want to use it
what does "reasoning" mean wrt AI?
imo "reasoning" is the ability to know true things without having learned them.
it is building knowledge/predictions/retrodictions/actions atop *principles*, instead of *evidence*.
Mind upload technique, GAN of 2 networks:
- Generative "you" network (agi) acts like you
- Discriminator network (agi) tries to distinguish bio vs silicon "you"
- loss function reduces discrepancy between 2 yous
If there's no distinction between silicon/bio you, you're ul'd!
@Allsdolllapp
@Liu_eroteme
what is recursive prompting? you mean like a human-in-the-loop? Or letting an LLM recursively prompt itself?
"homoiconic" is a term of art from the world of lisp that observes that the code of a program is an AST, and you can write programs to manipulate ASTs, and it's the same
bad idea of the day:
when developing ai archs, I like using the interpreter a lot, but then if some state is wrapped in a fn, I have to sprinkle `breakpoint()` everywhere to check in on the state of vars
not anymore, just jam everything in the global namespace
There are so many models of computation:
And so many kinds of turing machines:
Are we really giving up this line of interrogation after Neural Turing Machines/Diffable Neural Computers failed to scale well?
@Yampeleg
I like pointing out that the brain runs on 20W (~LED light) to run 100hz computations on incredibly redundant wetware.
Vs silicon chips run at 5ghz in a completely noise free fashion, so less need for redundancy.
We'll be inferencing, possibly training AGI on a phone
this "homoiconic ai" thing feels like "neural tool use", except the "tool" is an AI architecture; in this case, over the model's own weights
"neural tool use" is i think my own term, and refers to e2e diff'able architectures that use tools, ie a neural net
I'm doing metalearning/test-time-training/SGD-within-SGD, and made a weird observation
(gist: )
If you do SGD-within-SGD, you presumably repeat the weights so they can be trained on a per-batch-item basis. This costs memory + time ofc.
I'm finding that
Can AI reason?
SD3 can't follow the simplest logical `NOT` operator
LLMs fake it better when they're interpolating well covered domains, prevalent during training
My mission attempts to solve this by making the steps of reasoning differentiable. plz continue to wish me luck
hypothesis:
AGI could be merely a multimodal llm,
with a new mode that can READ/WRITE WEIGHTS
merely as input/output embeddings
and in the sense of hypernetworks, those weights can be applied to the inputs
and also introspected ofc - they can be read or written
🧵1/n
A theoretical free market needs:
- infinite supply
- infinite demand
- perfect information
Since we don't have infinite anything, government aids the free market by breaking up monopolies - ie product supply isn't allowed to have a monopoly.
Unions make sense as a tool to aid
@fchollet
The proper definition of GI considers *principled* reasoning instead of *evidence-based*.
Ex: F=ma represents principles that allow you to reason far outside the dataset
Ex2: "faith can heal" is also a principled view
neither has to be true, but you need symbolic reasoning
instead of banning AI, why not just ban crime?
"it can be used for fraud". ok, let's ban fraud, not AI.
"it can be used to make WMDs." ok, let's ban making WMDs. (but also, that info is available online, so...)
"it can be used to manipulate elections." ok, let's ban *that*.
Here's the toy dataset I'm starting off my Homoiconic AI on
AI struggles with variable indirection/"multi hop reasoning", so this should be a fun test
If anyone wants to play along, plz train your own arch against it!
@cloneofsimo
metalearning/test time training is brilliant and and clever. Underneath it all, it's a very simple proposition:
do SGD within SGD
that is do SGD at test time within SGD at train time
posted a short gist earlier actually
my dear chat, let me show you the neural magic called "test time training"
you can do SGD within SGD!
it lets your model do gradient descent on the *current context window*, like an instantaneous finetune, that the outer loop can harness
tinydemo:
Transformer vs Neurallambda on a difficult toy problem; how much data is needed?
NLam: learns on extremely few examples, eg 20(!), and generalizes phenomenally. Train (dotted) and Test (solid) lines hug each other
T: memorizes small training set ok. Never generalizes.
Follow up: while LLM latent activations seem to follow a gaussian distribution, there is apparently more structure when you project to 2D (tSNE/LLE)
this, the 1st layer, seems considerably more complex than downstream layers, i'll follow up l8r
orig:
If you plot a histogram of the hidden activations of an LLM (qwen 1.5B in this case), they're indistinguishable from a normal distribution
not sure what to make of that, feels like an AGI though would have weirder distributions tho (at least multimodal?)
@SydSteyerhart
i went through the same phase,
now i think AI is the new UBI.
opensource, local AI levels the playingfields of intelligence and productivity, and so accomplishes what UBI might be (charitably) intended to, but relationships remain consensual instead of compulsory
Hey editor geeks, what are your favorite tools?
Mine (in emacs, but im curious regardless of editor):
- avy-jump (jump to any char)
- multiple-cursors
- uniteai (llm in editor)
- in-editor python interpreter (run scripts, state is preserved/inspectable/rerunnable/mutateable)
@svpino
no one's defined reasoning, so please allow me:
"Reasoning is the ability to know true things that you have never learned. It is done by processing principles instead of evidence."
so no, LLMs have no principles, they're more like freestyle rappers saying what feels right
At its core, Neurallambda is about making these differentiable:
* datastructures (eg queues, trees, lambda calc)
* operations on them (eg push, swap, beta reduce)
This yields highly "interpretable" AI; it learns "programs" that we parse back out into human-readable form
How are people's weekends kicking off?
I'm still crunching on the "Homoiconic AI" thing (an architecture for reasoning) and working on:
- do SGD within SGD (metalearning)
- generate a network's weights from another network (hypernetworks)
- to stitch that together, I'm using