Cem Anil Profile Banner
Cem Anil Profile
Cem Anil

@cem__anil

Followers
2,041
Following
1,389
Media
15
Statuses
469

Machine learning / AI Safety at @AnthropicAI and University of Toronto / Vector Institute. Prev. student researcher @google (Blueshift Team) and @nvidia .

Toronto, Ontario
Joined November 2018
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@cem__anil
Cem Anil
3 months
AIs of tomorrow will spend much more of their compute on adapting and learning during deployment. Our first foray into quantitatively studying and forecasting risks from this trend looks at new jailbreaks arising from long contexts. Link:
@AnthropicAI
Anthropic
3 months
New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Read our blog post and the paper here:
Tweet media one
81
348
2K
3
7
61
@cem__anil
Cem Anil
2 years
🆕📜We study large language models’ ability to extrapolate to longer problems! 1) finetuning (with and without scratchpad) fails 2) few-shot scratchpad confers significant improvements 3) Many more findings (see the table & thread) Paper: [] 1/
Tweet media one
4
40
239
@cem__anil
Cem Anil
2 years
🆕📜When can **Equilibrium Models** learn from simple examples to handle complex ones? We identify a property — Path Independence — that enables this by letting EMs think for longer on hard examples. (NeurIPS) 📝: []()
Tweet media one
3
35
116
@cem__anil
Cem Anil
3 months
One of our most crisp findings was that in-context learning usually follows simple power laws as a function of number of demonstrations. We were surprised we didn’t find this stated explicitly in the literature. Soliciting pointers: have we missed anything?
@AnthropicAI
Anthropic
3 months
The effectiveness of many-shot jailbreaking (MSJ) follows simple scaling laws as a function of the number of shots. This turns out to be a more general finding. Learning from demonstrations—harmful or not—often follows the same power law scaling:
Tweet media one
1
7
111
7
6
69
@cem__anil
Cem Anil
2 years
Two high level takeaways: 1. Exploiting pattern matching capabilities of LLMs with no architectural tweaks can go surprisingly far. 2. Certain skills, like length generalization, can be learned better via in-context learning rather than finetuning, even with infinite data. 7/
1
3
22
@cem__anil
Cem Anil
2 years
How about few-shot scratchpad, a combo behind many strong LLM results? (eg. our recent #Minerva ) This leads to **substantial improvements in length generalization!** In-context learning enables variable length pattern matching, producing solutions of correct lengths. 5/
Tweet media one
1
0
15
@cem__anil
Cem Anil
2 years
Highly recommended! Spending time at Google Blueshift feels like taking a sneak peek into what the AI scene will look like a few years ahead. Best part, of course, is working closely with a fantastic team! @bneyshabur @Yuhu_ai_ @guygr @ethansdyer
@bneyshabur
Behnam Neyshabur
2 years
🔥Internship Opportunity on Improving the Reasoning Capabilities of Massive Language Models🔥: solving challenging problems in areas such as mathematics, science, programming, algorithms, and planning. Please see the following link for more info:
1
28
106
1
0
13
@cem__anil
Cem Anil
2 years
How does standard finetuning perform? The answer is: **very poorly, even with extensive scaling (up to 64b parameters).** Performance degrades rapidly on OOD lengths in a manner very similar across vastly different model sizes. 3/
Tweet media one
1
0
12
@cem__anil
Cem Anil
2 years
**It’s crucial to study upwards generalization:** It determines when a learner can go beyond the skill levels represented in a training set — useful for forecasting. One prereq for upwards gen. is the ability to think for longer when needed — something countless tasks require.
Tweet media one
1
1
10
@cem__anil
Cem Anil
2 years
What if we use a scratchpad? [] Surprisingly, **this doesn’t work either, even at scale!** Issues persist even when we account for subtleties regarding position encodings and EOS prediction. (see paper for more) 4/
Tweet media one
Tweet media two
1
0
10
@cem__anil
Cem Anil
2 years
Unlike scratchpad finetuning, where per-step error rate quickly increases on OOD lengths, the per-step error rate in few-shot scratchpad solutions follow a roughly constant trend - there’s no abrupt performance decrease on longer problems! 6/
Tweet media one
1
0
8
@cem__anil
Cem Anil
2 years
This was a fantastic collaboration with my amazing co-authors Ashwini Pokle* @ashwini1024 , Kaiqu Liang* @kevin_lkq , Johannes Treutlein @JohannesTreutle , Yuhuai (Tony) Wu @Yuhu_ai_ , Shaojie Bai @shaojieb , Zico Kolter @zicokolter and Roger Grosse @RogerGrosse .
1
0
6
@cem__anil
Cem Anil
2 years
Path independence describes the **insensitivity of a system’s asymptotic behaviour to its initialization.** A weather simulator is path dependent: different inits → different outputs. A pendulum, or a convex optim. solver are path independent: different inits → same output.
Tweet media one
1
0
6
@cem__anil
Cem Anil
2 years
There’s way more in the paper — check it out if you’re interested! paper: []() Also come say hi at @neurips !
Tweet media one
1
0
6
@cem__anil
Cem Anil
2 years
See paper for more - especially our detailed analyses regarding the failure modes of finetuning. Joint work with my fantastic collaborators @Yuhu_ai_ , Anders, @alewkowycz , @vedantmisra , @vinayramasesh , @AmbroseSlone , @guygr , @ethansdyer and @bneyshabur . 8/
1
0
6
@cem__anil
Cem Anil
2 years
Length generalization (LG) is important: Often, long examples are rare and intrinsically more difficult, yet are the ones we care more about. We run careful experiments on two tasks: parity prediction and Boolean variable assignment — a type of program execution task. 2/
Tweet media one
Tweet media two
1
0
6
@cem__anil
Cem Anil
2 years
Equilibrium models (≈ infinite-depth RNNs with input injection — see figure below) display fantastic upwards generalization when trained properly. When does this capability arise? We show that upwards generalization is closely tied to a property we call **Path Independence.**
Tweet media one
Tweet media two
1
0
5
@cem__anil
Cem Anil
2 years
Come say hi if you’re at @neurips this Tuesday! 11am, Hall J #323 : Path Independent Equilibrium Models Can Better Exploit Test-Time Computation [] 4pm, Hall J, #107 : Exploring Length Generalization in Large Language Models []
0
0
5
@cem__anil
Cem Anil
3 years
Exciting ICLR2021 workshop!
@Yuhu_ai_
Yuhuai (Tony) Wu
3 years
We’re excited to announce the MathAI workshop at ICLR 2021 : On the Role of Mathematical Reasoning in General Artificial Intelligence. Now accepting submissions! Submission Link: Deadline: Feb 26, 11:59PM PST
1
17
94
0
0
4
@cem__anil
Cem Anil
3 months
@CFGeek Thanks! We’re aware of this kind of scaling law for token-wise losses. The first author of the paper you linked paper is a co-author in ours :) I should have said few/many shot learning in my tweet above, which has a shared but different problem structure.
2
0
4
@cem__anil
Cem Anil
2 years
We empirically show that path-independence (PI) is tied to upwards gen., because it lets models think for longer on harder problems. **PI hypothesis:** Given models that fit the training set, the PI ones can better exploit test time compute, displaying better upwards gen.
Tweet media one
Tweet media two
1
0
4
@cem__anil
Cem Anil
2 years
Correlation alone is not strong evidence — maybe there are confounders! We also confirm that intervening in the training setup to directly promote path independence improves generalization. Likewise, penalizing path-independence hurts generalization.
Tweet media one
1
0
3
@cem__anil
Cem Anil
2 years
@osageev Chopin Ballade No. 1?
1
0
3
@cem__anil
Cem Anil
2 years
@AnimaAnandkumar @neurips Thank you for your comment!  That learning only given one-step evolution data is enough to predict long-horizon behaviour of chaotic systems is very surprising.  Your dissipativity regularizer seems very useful for us as well. Thanks again for the pointer!
0
0
2
@cem__anil
Cem Anil
2 years
After reliably quantifying how path-independent a given model is (see paper for more), we demonstrate a strong connection between PI and upwards generalization.
Tweet media one
1
0
2
@cem__anil
Cem Anil
3 months
@CFGeek Agreed! Seems quite interesting. Another similar idea: Let’s say the function vectors are largely responsible for few-shot learning on simple tasks. There are not that many attention heads that implement the function vector mechanism afaik. ⬇️
1
0
2
@cem__anil
Cem Anil
5 years
@nickfrosst This is great!
1
0
1
@cem__anil
Cem Anil
5 years
Very interesting work!
@niru_m
Niru Maheswaranathan
5 years
New work out on arXiv! Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics (), with fantastic co-authors @ItsNeuronal , @MattGolub_Neuro , @SuryaGanguli and @SussilloDavid . #tweetprint summary below! 👇🏾 (1/4)
3
52
201
0
0
1
@cem__anil
Cem Anil
3 months
@CFGeek There is a huge overlap between these for sure. I think the structure and data distribution differ enough that results on one might not generalize readily to the other. E.g. the task vector mechanism seems fairly specific to few/many shot learning.
1
0
1
@cem__anil
Cem Anil
3 months
@CFGeek Say we deleted these and only these heads from the network. 1) Do we still get token-wise loss scaling laws under pretraining distr? If so, did the exponent change? 2) Do we still get few-shot learning scaling laws? If so, did the exponent change?
0
0
1
@cem__anil
Cem Anil
6 years
Inspiring talk by Prof. Michael Levin!
0
0
1
@cem__anil
Cem Anil
3 years
0
0
1
@cem__anil
Cem Anil
5 years
Great stuff!
@UofTNews
U of T News
5 years
With @Google 's backing, @CreativeDLab startup @BenchSci uses #AI to create 'super scientists' #UofT
0
4
14
0
0
1
@cem__anil
Cem Anil
3 months
@agarwl_ @hu_yifei @arankomatsuzaki Great work, congrats! Loads of interesting new/complementary findings in there, definitely worth a detailed read. DM me about Claude research access, will check what’s possible :)
0
0
1
@cem__anil
Cem Anil
2 years
@saakethmm The num of padding tokens on left/right is indeed random (total num of tokens is fixed) — to train all position embeddings even on short instances. We add the same num of padding tokens on both the input and scratchpad to keep the distance between the input and scratchpad fixed.
0
0
1