Let us not mistake how we want the world to be for how it is. 🪄 Tensor-enjoyer 🧪
@FinetuneLearn
. Occasionally writing at “Context Windows” on Substack.
Context Windows
#2
is out!
Recently I’ve been hearing a lot about search and other flavors of “inference-time compute”. But could it really it scale? And if so, why *now*?
Links in thread…
The normalization scheme that DeepMind researchers came up with for their "linear recurrent unit" (LRU) is a nice example of how it is possible to predictably engineer circuits in artificial neural networks, when you know what you're doing. A thread:
YES! If you initialize a LoRA layer based on the SVD of the original weight matrix (with its top singular values & vectors), you get significantly better fine-tuning results.
This is a straight-up free lunch, as far as I can tell.
PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models
Significantly improved finetuned perf by simply changing the initialization of LoRA's AB matrix from Gaussian/zero to principal components of W
repo:
abs:
What excites me most about the rising tide of RNNs/SSMs is that it could let the fields of machine learning and computational neuroscience use the same modeling tools.
Note: sparse coding is an *established* method for disentangling representations. Anthropic did not invent it, nor did they claim to. If their new results seem surprising, now's a great time to revisit the older literature (Olshausen, Kanerva, etc.).
Wow! Papers from two different teams—one from academia and one from Google DeepMind—with the same finding: linear recurrence + local (sliding window) attention is your best bet if you want an efficient alternative to global attention.
Simple linear attention language models balance the recall-throughput tradeoff
Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is
Stability changed the name of these models to "Stable Beluga 1/2" and quietly removed the sentence of the blog post that mentioned they used two unnamed LLMs to generate their dataset. (This likely means they used OpenAI models, in clear violation of ToS)
Prediction for 2024/2025:
OpenAI showcases an AI assistant that controls a virtual desktop or browser to do a bunch of routine white-collar job tasks with minimal human correction. Public freakout in response to this is significantly more intense than it was for Sora or GPT-4.
Recently, I've seen lots of buzz about "entropy-based sampling" for LLMs, aka the "Shrek sampler". It's time to put your mana where your mouth is. I've tried to make the resolution criteria relatively objective, and won't bet on the market myself.
Link in thread below.
Wait, so then it's no mystery why OpenAI's new base models are good at chess: they explicitly crafted the pretraining dataset to cover that! I presume whatever extra tuning they did to chat models wasn't focused on chess, so some of that was forgotten.
@GrantSlatton
@davidad
> Transformers significantly outperform neural sequence models with recurrent or convolutional representations on ICLL tasks […] we provide evidence that their ability to do so relies on specialized “n-gram heads” (higher-order variants of previously-described “induction heads”)
Neural networks are associative memory machines par excellence. If you want to wire them by hand or to interpret them, this is important to know. (Diagram is mine, but the content is classic connectionist stuff, and probably goes back to at least the 1940s w/ McCulloch & Pitts)
“Orthogonalization” aka “that trick that jailbreaks Llama3 weights”. It’s actually a pretty neat training-free method to ablate a feature, lots of potential uses if it works well.
The Transformer's quadratic complexity won't kill it. What might is that, for long contexts, the KV cache ends up being huge, *even bigger than the weights*. Crossover point is when L×2×D×N = L×12×(D^2). Compute is cheap, but memory bandwidth is expensive.
Here are 5 policy recommendations for the upcoming AI Safety Summit in Seoul, from me and my colleagues at ICFG.
In Bletchley, world leaders discussed major risks of frontier AI development. In Seoul, they should agree on concrete next steps to address them.
Why are we instructing our LLMs in 50-line megaprompts? Weren’t structured control flow, subroutines, namespaces etc. invented like a half century ago?
Contrary to claims SB 1047 would only impact AI megacorps, “covered models” include any non-derivative model that is as generally capable as circa-2024 frontier models. Algorithmic progress means in a matter of years, smaller players and even hobbyists *will* fall into its scope.
I support SB 1047: the regulation asks billion-$ tech companies to take reasonable precautions when training models with the greatest capability for misuse, poses few to no costs on other developers, and supports academic & open-source research through compute funding.
This looks legit. Attention heads tend to use the beginning of sequence for "null attention", so maintaining those tokens at the start of the KV cache allows for better sliding-window generation of long text. Can also be combined with long context tricks.
Much of the backlash to SB 1047 is best seen as an expression of negative partisanship against the AI Safety movement. For those folks, the key point is not “This bill has XYZ specific problems”, but rather “This whole campaign must be stopped, or else the Doomers win”
Researchers keep writing these papers with headline claims that “Transformers are X” or “Attention is Y”, with tiny disclaimers inside that they’re *really* just talking about linear attention, not the kind of attention that Transformers actually use.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Presents Mamba-2, which outperforms Mamba and Transformer++ in both perplexity and wall-clock time
In Mamba, the selection mechanism has a knob to modulate the flow of time, via Δt. If an input sets Δt → 0, time is effectively frozen, so the state value is momentarily prevented from changing, which acts to "hold" or "latch onto" a memory. And Δt → ∞ fast-forwards to reset!
@rom1504
Nobody asked the content authors. Many of them are objecting now, yet nothing is done. I think by default we should take an opt-in approach, where the author must choose to make their data broadly available as part of a corpus.
Re: your question -> no, I don't mean that
From my perspective, "Is it really *reasoning*?" and "Does it really have a *world model*?" and "Is that really *generalization*?" are fundamentally kind of confused. These ten-dollar words are ways of expressing normative judgments that a computation is useful-for-some-purposes.
.
@TrentonBricken
explains how we know LLMs are actually generalizing - aka they're not just stochastic parrots:
- Training models on code makes them better at reasoning in language.
- Models fine tuned on math problems become better at entity detection.
- We can just
FYI: I now think SB 1047 is not a bad bill. It definitely isn’t my favorite approach, but given a stark choice between it and a random draw from the set of alternative AI regulatory proposals, I’d be picking it more often than not.
If you use a custom 20B token synthetic training dataset and don't release it for public scrutiny, I will just assume you trained your model on the test data, or on stuff derived from the test data.
How far does one billion parameters take you? As it turns out, pretty far!!!
Today we're releasing phi-1.5, a 1.3B parameter LLM exhibiting emergent behaviors surprisingly close to much larger LLMs.
For warm-up, see an example completion w. comparison to Falcon 7B & Llama2-7B
Wild seeing the race to cobble together AI systems that make decisions:
- autonomously
- with brittle methods
- for reasons nobody understands
- daisy-chained across the Internet
- without any vigilance controls
- affecting people with no notice or consent
ArXiv is already a junkyard of preprints peddling promises of infinite memory—if only we would tweak the Transformer just a tad. Whenever you see a new one, the question to ask is always “Why this one?” This may be the one, but what makes this time different?
Google presents Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
1B model that was fine-tuned on up to 5K sequence length passkey instances solves the 1M length problem
🚨 SB 1047 was just amended🚨
- “Covered model” now means a model whose training is >10^26 FLOP and costing >$100M estimated worth of compute (inflation-adjusted)
- “Derivative model” now excludes models fine-tuned for >25% of the original training compute
(continued below ⤵️)
Feels notable that Anthropic, OpenAI, and Google were all able to quickly figure out massive Transformer context windows without anybody revealing their methods. And the open community is hot on their heels. All that secrecy wasn't worth much, apparently.
If we somehow time-traveled a copy of GPT-4o back to 2004 and let a focus group of NeurIPS (then NIPS) attendees interact with it for 2 hours, what percent would endorse calling it “AGI” afterward?
(Pretend it won’t give responses that would require knowledge of the then-future.)
@rom1504
No. I would say we ML researchers should hold ourselves to a high standard of conduct, such that that when people tell us they don't want us training on the content they authored, we respect their wishes.
How does Stability get to call StableVicuna "open source" when the model is derived from the not-open-source Vicuna, and is a not-open-source LLaMA tuned with ToS-encumbered data from the not-open-source GPT-3/ChatGPT?
Contrast pairs are overpowered. Once you have them, you can use them to generate control vectors, and to initialize classifiers, and to do RL/DPO, and probably more
To make the probes, we track how the model’s internal state changes between “Yes” vs “No” answers to questions like "Are you doing something dangerous?"
We use this info to detect when a sleeper agent is about to misbehave (e.g. insert a code vulnerability). It works quite
Transformer is seemingly now the all-around heavyweight champion. Doesn't matter whether autoregressive or diffusion, text or image or video or robotics/multimodal, unsupervised or supervised or RL ...
Stability AI announces Stable Diffusion 3
most capable text-to-image model, utilizing a diffusion transformer architecture for greatly improved performance in multi-subject prompts, image quality, and spelling abilities.
Prompt: Epic anime artwork of a wizard atop a mountain
How a US/China superintelligence arms race will play out:
“The CCP is going to have an all-out effort to infiltrate American AI labs.
Thousands of people, the full force of the Ministry of State Security.
There's an enormous incentive for a first strike.”
@leopoldasch
ReFT: Representation Finetuning for Language Models
10x-50x more parameter-efficient than prior state-of-the-art parameter-efficient fine-tuning methods
repo:
abs:
This was funny when the hacked accounts were just random individuals, but OpenAI’s new official newsroom account getting taken over by crypto-spammers is just a real bad look.
Excited to try this out! (Though I'm kinda doubtful it'll be better than Hedgehog)
It's basically just linear attention on top of queries & keys that have been passed through a LayerNorm -> elementwise squaring.
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Advancing the frontier of subquadratic architectures for Language Models (LMs) is crucial in the rapidly evolving field of natural language processing. Current innovations, including State Space
I used to *love* sneering at
@GaryMarcus
and his takes on AI progress. Something shifted when I started building products w/ LLMs in my day job. I started seeing more vividly why reliability matters, and how the current zeitgeist is hurting itself making promises we can't keep
This is basically DPO without preference labels! Simply assume the supervised responses to prompts are better than the model's responses to those same prompts. Similar to the trick Intel used for Neural Chat, where they assumed GPT-4 responses > Llama2 responses.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Significantly improves the LLM’s performance across a variety of benchmarks and even outperform models trained through DPO with extra GPT-4 preference data
OPINION: we should probably move away from training AI systems on datasets like LAION-400M/5B and Books3, fair use aside.
(I say this as someone who knows the folks that collected those datasets & who thinks they deserve credit for doing uncelebrated but very impactful work.)
Attention as an RNN
abs:
"attention can be viewed as an RNN with the special ability to compute its many-to-one RNN output efficiently"
Proposes Aaren, a new module that can be trained in parallel (like Transformers) but also be efficiently updated at
Worried about the future of openness in AI? Here is a way to help:
We're putting together a public list of all the good work that's been enabled by open-weight foundation models, to show why transparency & public scrutiny is worth protecting.
⬇️ Links below ⬇️
If we can detect an LLM is copying from a span of context (à la induction heads), couldn't we then grab the rest of the span and run it through the model in parallel (à la speculative sampling)?
Could be an easy win for tasks that call for in-context retrieval...
As evidence of this, the California state legislature is considering another AI bill, AB 3211. That bill would have far worse impacts on tech companies and open-source, as reported by observers like
@deanwball
,
@TheZvi
, &
@binarybits
. Yet it’s produced almost no real opposition.
Evaluation is hard! This goes for AI just as with us. In games like chess and Go, evaluation is easy, which allows for tight feedback loops and rapid self-improvement. But in rich domains, the bottleneck IS evaluation (doing experiments, peer review, &c.)
🆕
@latentspacepod
: Is finetuning GPT4o worth it?
w/
@AlistairPullen
of
@cosine_sh
Betteridge's law says no: with 59 different flavors of RAG, and >2million token context + prompt caching, it's reasonable to believe that "in context learning is all you need".
But Genie is the
This is earth-shattering news.
The "hard problem" of mechanistic interpretability has been solved.
The formal/cautious/technical language of most ppl commenting on this obscures the gravity of it.
What this means -> not just AGI, but *safe* *superintelligence* is 100% coming🧵
IDK who needs to hear this but the "70k unused embeddings for multimodal extensions" line item is pure filler. If they weren't used during training, they just contain random noise. You could've added those extra rows to the embedding matrix yourself, for the same effect.
There are a few cool things to note:
• Trained the whole time with a 16K context–4x that of LLaMA2 and 8x of GPT-3
• Strong evals, especially on the instruct tuned version
• 70k unused embeddings for multimodal extensions
• Apache license!
Read this post. It describes—in better words than I've ever found—a shift in paradigm within ML in recent years, towards an "industrial" one based on predictable input-output relations. Lots of great lines, some of which I'll quote below
(h/t
@g_leech_
)
Before we scramble to deeply integrate LLMs everywhere in the economy, can we pause and think whether it is wise to do so?
This is quite immature technology and we don't understand how it works.
If we're not careful we're setting ourselves up for a lot of correlated failures.
What I mean is "can perform complex reasoning"
wait nvm I meant "can win at strategic games"
wait nvm I meant "can understand human language"
wait nvm I meant "can automate economically-valued office tasks"
wait nvm I meant "can assist in scientific discovery"
wait nvm I meant
Rather than trying to "solve" superposition & to always explain/predict/control neural network computations using the same units of analysis, consider a more "Hopfieldian" lens, where representational spaces rule (via dynamics at multiple valid scales)
At some point I switched from seeing neural networks as arcane devices to seeing them as moldable variants of "boring" building blocks from signal processing, feedback control, associative learning, & functional programming. Like some kind of function approximation plastic/epoxy
Re: open AI weights and China competition
If what matters is “Who best monopolizes innovation on this technology?”, encouraging domestic firms to share weights may be bad.
But if what matters is “Who best diffuses this technology?”, encouraging that practice may be quite good.
Model developers try to train “safe” models that refuse to help with malicious tasks like hacking
...but in new work with
@JacobSteinhardt
and
@ancadianadragan
, we show that such models still enable misuse: adversaries can combine multiple safe models to bypass safeguards 1/n
Current obsession: having LLMs simulate abstract machines step-by-step. This is GPT-4 acting as a register machine doing addition. Uses the INC/DEB language that I'd first read about in Dan Dennett's "Secrets of Computer Power Revealed".
OpenAI rep: “OK so this is what I wrote down. What do you see?” *pointing phone at paper*
ChatGPT: “Aww. I see ‘I love ChatGPT’. That’s so sweet of you!”
*audience applauds*
ChatGPT: “… wowwww, that’s quite the outfit you have on 😏 Love—” *mic cuts suddenly *
Pure comedic gold
STOP DOING INTERPRETABILITY
Nonlinear coupling parameters were not supposed to be given names
Want to try out some interpretable circuits, for a laugh? We had a tool for that: It was called "PROGRAMMING"
Reminder that SB 1047 is not the only consequential AI-related bill that may pass the California legislature this week. There’s also SB 892, SB 896, SB 942, AB 1836, AB 2013, AB 2602, AB 2930, and AB 3211.
If this 👇 generalizes, could you leverage it to watermark your model weights before release? Like, you train it to output Y only when prompted with X. In theory, with ZKP, could you even give evidence a set of weights are derived from yours without publicly revealing what X is?
New Anthropic Paper: Sleeper Agents.
We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.
It apparently took < 1 year for competiton to create LLMs that are comparable/better than GPT-4 (in its original gpt-4-0314 form). That is very fast!
This result may or may not hold up, but that it's even *plausible* is evidence enough these capabilities will become commodities
🔥Breaking News from Arena
Google's Bard has just made a stunning leap, surpassing GPT-4 to the SECOND SPOT on the leaderboard! Big congrats to
@Google
for the remarkable achievement!
The race is heating up like never before! Super excited to see what's next for Bard + Gemini
Aaaaand one of the OpenAI folks estimated the GPT-3.5 base model to be around 1800 ELO, which is exactly the cutoff score for games included in the GPT-4 base model pretraining dataset... 🤔
SB 1047 nightmare scenario for open-sourcers like Meta and Mistral: even if they can guarantee a model has *no biology-related knowledge or skills whatsoever*, the Attorney General + court can block release because terrorists might *teach it from scratch* how to build bioweapons.
Does a language model trained on “A is B” generalize to “B is A”?
E.g. When trained only on “George Washington was the first US president”, can models automatically answer “Who was the first US president?”
Our new paper shows they cannot!
Fun fact: the (Moore-Penrose) pseudoinverse can be used to set the weights of neural network associative memories without training! This trick has been known since at least the 1980s, from work by Personnaz, Guyon, & Dreyfus applying it to Hopfield networks
@davidad
The pseudoinverse is just so elegant: numerically stable, easily derived from the SVD, returns a preimage with least square error. It coincides with the inverse when the matrix is actually invertible, so barely any reason to teach straight inverses, tbh.
Only use reinforcement learning (RL) if you absolutely must. RL is the “approach of last resort”, as
@nostalgebraist
has called it. They say that training neural networks is easy because they *want* to learn, but in RL, you will be fighting every cursed step of the way.
Learning programs with backprop is hard, in general.
Computation needs input-dependent branching. Backprop only sees linear sensitivity, so trying more than 1 branch at once requires superposition. But then some program(s) will cause interference that ruins credit assignment
I think that hidden scratchpads are an inherently deceptive design. If you set your AI system up to output internal thoughts / actions that are inaccessible to the user, then you're preventing them from properly overseeing the system! This is bad and very unnecessary!
1/ Can AIs deceive their users on their own initiative?
We find that GPT-4, trained to be honest and harmless, can take illegal actions like insider trading and lie about it to its user without being instructed to do so. This finding was demonstrated at the
#AiSafetySummit
.
@iScienceLuvr
Yes but it isn’t fair to evaluate organizations based on what they might be in the process of developing. We judge them by what they’ve actually verifiably developed.
ML influencers: hehe silly
@GaryMarcus
, always peddling his "neurosymbolic hybrids" BS
the same ML influencers: the *real* way to use GPT-3 is with chain-of-thought and with code generation and with tool use and with databases and and and
This is zeroscope_v2_XL. A new 1024x576
#texttovideo
model designed to take on Gen-2. Explore prompts with the new 576x320 model, then commit to a high-res render by upscaling with zeroscope_v2_XL via vid2vid in the 1111 text2video extension. Check it out:
Return of the encoder-decoder king! And released with intermediate checkpoints etc. just like Pythia, which should make it great for open science, including interpretability👏
🚀 Introducing Pile-T5!
🔗 We (EleutherAI) are thrilled to open-source our latest T5 model trained on 2T tokens from the Pile using the Llama tokenizer.
✨ Featuring intermediate checkpoints and a significant boost in benchmark performance.
Work done by
@lintangsutawika
, me
This Monte Carlo integration and importance sampling stuff is really something!
You’re telling me I can compute integrals/expectations *by sampling*? And I can even do it sampling from a *different* distribution? Wild.
Taking a moment to express genuine surprise at how many new pretrained LLMs for English have been released within < 1 year:
- Cerebras-GPT / BTLM
- Falcon
- Galactica
- LLaMA / Llama 2
- Mistral
- MPT
- OpenLLaMA
- Persimmon
- Phi
- Pythia
- Qwen
- RWKV
Many-shot jailbreaking exploits the long context windows of current LLMs. The attacker inputs a prompt beginning with hundreds of faux dialogues where a supposed AI complies with harmful requests. This overrides the LLM's safety training:
Judging by recent tweets, LLMs are Silicon Valley's "hot new thing". I think its worth sharing some intuitions about the tradeoffs that I think make it hard to design a decisively better architecture than the current heavyweight champion—the autoregressive Transformer. 🧵
(1/N)
Question for the AI community:
Should AI systems that can be used to easily produce weapons of mass destruction be irreversibly proliferated as open weights?
(Example WMD that seems very plausible in the next few years: AI cyberweapon that can take down our power grids.)
To avoid stuff like this, I think you want to offload to a finite-state machine that defines the set of allowed choices at each state, so the LLM is only responsible for mapping user-input to the current choice set, & for mapping outputs back to language.
If you're using ALiBi or RoPE, it's probably best to turn it off on some attention heads (for ALiBi, just set their slopes to 0) so the model can do unbiased arbitrary-length lookups. Works with Flash Attention, even!
Resurrecting Recurrent Neural Networks for Long Sequences
Shows that careful design of deep RNNs performs on par with SSMs on long-range reasoning tasks with comparable speed.
That's it! IMO, the biggest takeaway of the LRU paper is that proper parameterization, initialization, and normalization are powerful and woefully underrated tools for engineering neural circuits that behave in predictable ways.
We often think of bigger neural networks as "more complex", but AFAICT this intuition is wrong from the lens of compression (as in Solomonoff induction, adaptive coding, dictionary learning etc.).
Very simple algorithms can leverage massive memory w/o increasing model complexity
There're few who can deliver both great AI research and charismatic talks. OpenAI Chief Scientist
@ilyasut
is one of them.
I watched Ilya's lecture at Simons Institute, where he delved into why unsupervised learning works through the lens of compression.
Sharing my notes:
-
"Is the Transformer architecture Turing-complete?" is not quite the right question to ask, IMO. What TMs show is that any algorithm can be viewed as (1) a *finite state* control policy that interacts w/ (2) a separate working memory. Our NNs only need to express the logic of (1).