Cem Anil @cem__anil profile

Cem Anil

@cem__anil

Followers

2,041

Following

1,389

Media

15

Statuses

469

Machine learning / AI Safety at @AnthropicAI and University of Toronto / Vector Institute. Prev. student researcher @google (Blueshift Team) and @nvidia .

https://t.co/PiXJfADIc5

Toronto, Ontario

Joined November 2018

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

America • 1075191 Tweets

Happy 4th • 875103 Tweets

Labour • 610827 Tweets

Independence Day • 523748 Tweets

Reform • 488385 Tweets

Tories • 290903 Tweets

#loveIsland • 217427 Tweets

Tory • 203913 Tweets

Keir Starmer • 130811 Tweets

#GeneralElection2024 • 119826 Tweets

Mimi • 109754 Tweets

Sean • 103380 Tweets

Mario Delgado • 74126 Tweets

Maya • 61849 Tweets

Corbyn • 54858 Tweets

#TemptationIsland • 40370 Tweets

Sky News • 32759 Tweets

Raul • 32740 Tweets

Andy Murray • 31305 Tweets

Channel 4 • 26233 Tweets

Luca • 25227 Tweets

Lib Dems • 23625 Tweets

Reino Unido • 20009 Tweets

Matilda • 18184 Tweets

THE ARCHER • 15275 Tweets

#ExitPoll • 12984 Tweets

Joey Chestnut • 10571 Tweets

GUILTY AS SIN • 10144 Tweets

Mad Nads

Chris Grayling

Salcedo

Ιουλια

Blyth

Terrier

Jeremy Vine

Ludovica

Widdecombe

EL POP HA VUELTO

野村大樹

Sue Barker

Kwarteng

LOKADEMÁS OUT NOW

Sunderland

Ciaran

Lorenne

MAYO IS SO CUTE

Steve Baker

Gabi Xavier

Nadine Dorries

#استغفر_الله_لتمحي_ذنوبك

Last Seen Profiles

@com_in_

@Vox_Akuma

@ArthroscopyJ

@cunstler

@shorebirddev

@AnkaraSwingerC1

@JamieEldridgeMA

@VmaniakJ

@swissbeatbox

@azmiiira

@Sainetienne

@turk_ifsa2019

@Cde_Ostallos

@_Nick_Bosnar_

@angiemartoccio

@ibmbader

@MSpiessl

@com_in_

@ControlUnion_UK

@cunstler

Pinned Tweet

Cem Anil

@cem__anil

3 months

AIs of tomorrow will spend much more of their compute on adapting and learning during deployment. Our first foray into quantitatively studying and forecasting risks from this trend looks at new jailbreaks arising from long contexts. Link:

Many-shot jailbreaking \ Anthropic

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

www.anthropic.com

Anthropic

@AnthropicAI

3 months

New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Read our blog post and the paper here:

81

348

2K

3

7

61

Cem Anil

@cem__anil

2 years

🆕📜We study large language models’ ability to extrapolate to longer problems! 1) finetuning (with and without scratchpad) fails 2) few-shot scratchpad confers significant improvements 3) Many more findings (see the table & thread) Paper: [] 1/

4

40

239

Cem Anil

@cem__anil

2 years

🆕📜When can **Equilibrium Models** learn from simple examples to handle complex ones? We identify a property — Path Independence — that enables this by letting EMs think for longer on hard examples. (NeurIPS) 📝: []()

3

35

116

Cem Anil

@cem__anil

3 months

One of our most crisp findings was that in-context learning usually follows simple power laws as a function of number of demonstrations. We were surprised we didn’t find this stated explicitly in the literature. Soliciting pointers: have we missed anything?

Anthropic

@AnthropicAI

3 months

The effectiveness of many-shot jailbreaking (MSJ) follows simple scaling laws as a function of the number of shots. This turns out to be a more general finding. Learning from demonstrations—harmful or not—often follows the same power law scaling:

1

7

111

7

6

69

Cem Anil

@cem__anil

2 years

Two high level takeaways: 1. Exploiting pattern matching capabilities of LLMs with no architectural tweaks can go surprisingly far. 2. Certain skills, like length generalization, can be learned better via in-context learning rather than finetuning, even with infinite data. 7/

1

3

22

Cem Anil

@cem__anil

2 years

How about few-shot scratchpad, a combo behind many strong LLM results? (eg. our recent #Minerva ) This leads to **substantial improvements in length generalization!** In-context learning enables variable length pattern matching, producing solutions of correct lengths. 5/

1

0

15

Cem Anil

@cem__anil

2 years

Highly recommended! Spending time at Google Blueshift feels like taking a sneak peek into what the AI scene will look like a few years ahead. Best part, of course, is working closely with a fantastic team! @bneyshabur @Yuhu_ai_ @guygr @ethansdyer

Behnam Neyshabur

@bneyshabur

2 years

🔥Internship Opportunity on Improving the Reasoning Capabilities of Massive Language Models🔥: solving challenging problems in areas such as mathematics, science, programming, algorithms, and planning. Please see the following link for more info:

1

28

106

1

0

13

Cem Anil

@cem__anil

2 years

How does standard finetuning perform? The answer is: **very poorly, even with extensive scaling (up to 64b parameters).** Performance degrades rapidly on OOD lengths in a manner very similar across vastly different model sizes. 3/

1

0

12

Cem Anil

@cem__anil

3 months

Relatedly, @dwarkesh_sp asks prescient questions about risks from test-time compute in his latest podcast with @TrentonBricken and @_sholtodouglas . It’s a fantastic episode, give it a listen!

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Listen now | Had so much fun chatting with my good friends Trenton Bricken and Sholto Douglas on the podcast. No way to summarize it, except: This is the best context dump out there on how LLMs are...

www.dwarkeshpatel.com

1

2

10

Cem Anil

@cem__anil

2 years

**It’s crucial to study upwards generalization:** It determines when a learner can go beyond the skill levels represented in a training set — useful for forecasting. One prereq for upwards gen. is the ability to think for longer when needed — something countless tasks require.

1

10

Cem Anil

@cem__anil

2 years

What if we use a scratchpad? [] Surprisingly, **this doesn’t work either, even at scale!** Issues persist even when we account for subtleties regarding position encodings and EOS prediction. (see paper for more) 4/

1

0

10

Cem Anil

@cem__anil

2 years

Unlike scratchpad finetuning, where per-step error rate quickly increases on OOD lengths, the per-step error rate in few-shot scratchpad solutions follow a roughly constant trend - there’s no abrupt performance decrease on longer problems! 6/

1

0

8

Cem Anil

@cem__anil

2 years

This was a fantastic collaboration with my amazing co-authors Ashwini Pokle* @ashwini1024 , Kaiqu Liang* @kevin_lkq , Johannes Treutlein @JohannesTreutle , Yuhuai (Tony) Wu @Yuhu_ai_ , Shaojie Bai @shaojieb , Zico Kolter @zicokolter and Roger Grosse @RogerGrosse .

1

0

6

Cem Anil

@cem__anil

2 years

Path independence describes the **insensitivity of a system’s asymptotic behaviour to its initialization.** A weather simulator is path dependent: different inits → different outputs. A pendulum, or a convex optim. solver are path independent: different inits → same output.

1

0

6

Cem Anil

@cem__anil

2 years

There’s way more in the paper — check it out if you’re interested! paper: []() Also come say hi at @neurips !

1

0

6

Cem Anil

@cem__anil

2 years

See paper for more - especially our detailed analyses regarding the failure modes of finetuning. Joint work with my fantastic collaborators @Yuhu_ai_ , Anders, @alewkowycz , @vedantmisra , @vinayramasesh , @AmbroseSlone , @guygr , @ethansdyer and @bneyshabur . 8/

1

0

6

Cem Anil

@cem__anil

2 years

Length generalization (LG) is important: Often, long examples are rare and intrinsically more difficult, yet are the ones we care more about. We run careful experiments on two tasks: parity prediction and Boolean variable assignment — a type of program execution task. 2/

1

0

6

Cem Anil

@cem__anil

2 years

Equilibrium models (≈ infinite-depth RNNs with input injection — see figure below) display fantastic upwards generalization when trained properly. When does this capability arise? We show that upwards generalization is closely tied to a property we call **Path Independence.**

1

0

5

Cem Anil

@cem__anil

2 years

Come say hi if you’re at @neurips this Tuesday! 11am, Hall J #323 : Path Independent Equilibrium Models Can Better Exploit Test-Time Computation [] 4pm, Hall J, #107 : Exploring Length Generalization in Large Language Models []

Exploring Length Generalization in Large Language Models

The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets...

arxiv.org

0

5

Cem Anil

@cem__anil

3 years

Exciting ICLR2021 workshop!

Yuhuai (Tony) Wu

@Yuhu_ai_

3 years

We’re excited to announce the MathAI workshop at ICLR 2021 : On the Role of Mathematical Reasoning in General Artificial Intelligence. Now accepting submissions! Submission Link: Deadline: Feb 26, 11:59PM PST

1

17

94

0

4

Cem Anil

@cem__anil

3 months

@CFGeek Thanks! We’re aware of this kind of scaling law for token-wise losses. The first author of the paper you linked paper is a co-author in ours :) I should have said few/many shot learning in my tweet above, which has a shared but different problem structure.

2

0

4

Cem Anil

@cem__anil

2 years

We empirically show that path-independence (PI) is tied to upwards gen., because it lets models think for longer on harder problems. **PI hypothesis:** Given models that fit the training set, the PI ones can better exploit test time compute, displaying better upwards gen.

1

0

4

Cem Anil

@cem__anil

2 years

Correlation alone is not strong evidence — maybe there are confounders! We also confirm that intervening in the training setup to directly promote path independence improves generalization. Likewise, penalizing path-independence hurts generalization.

1

0

3

Cem Anil

@cem__anil

3 months

@CFGeek Here’s another paper that we cite that observes a similar token-wise power law trend under the pretraining distribution:

Effective Long-Context Scaling of Foundation Models

We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training...

arxiv.org

0

3

Cem Anil

@cem__anil

2 years

@osageev Chopin Ballade No. 1?

1

0

3

Cem Anil

@cem__anil

2 years

@AnimaAnandkumar @neurips Thank you for your comment! That learning only given one-step evolution data is enough to predict long-horizon behaviour of chaotic systems is very surprising. Your dissipativity regularizer seems very useful for us as well. Thanks again for the pointer!

0

2

Cem Anil

@cem__anil

2 years

After reliably quantifying how path-independent a given model is (see paper for more), we demonstrate a strong connection between PI and upwards generalization.

1

0

2

Cem Anil

@cem__anil

3 months

@CFGeek Agreed! Seems quite interesting. Another similar idea: Let’s say the function vectors are largely responsible for few-shot learning on simple tasks. There are not that many attention heads that implement the function vector mechanism afaik. ⬇️

Function Vectors in Large Language Models

We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on...

arxiv.org

1

0

2

Cem Anil

@cem__anil

5 years

@nickfrosst This is great!

1

0

1

Cem Anil

@cem__anil

5 years

Very interesting work!

Niru Maheswaranathan

@niru_m

5 years

New work out on arXiv! Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics (), with fantastic co-authors @ItsNeuronal , @MattGolub_Neuro , @SuryaGanguli and @SussilloDavid . #tweetprint summary below! 👇🏾 (1/4)

3

52

201

0

1

Cem Anil

@cem__anil

3 months

@CFGeek There is a huge overlap between these for sure. I think the structure and data distribution differ enough that results on one might not generalize readily to the other. E.g. the task vector mechanism seems fairly specific to few/many shot learning.

Function Vectors in Large Language Models

We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on...

arxiv.org

1

0

1

Cem Anil

@cem__anil

3 months

@CFGeek Say we deleted these and only these heads from the network. 1) Do we still get token-wise loss scaling laws under pretraining distr? If so, did the exponent change? 2) Do we still get few-shot learning scaling laws? If so, did the exponent change?

0

1

Cem Anil

@cem__anil

6 years

Inspiring talk by Prof. Michael Levin!

0

1

Cem Anil

@cem__anil

3 years

@james_r_lucas @nvidia @FidlerSanja Congratulations!!

0

1

Cem Anil

@cem__anil

5 years

Great stuff!

U of T News

@UofTNews

5 years

With @Google 's backing, @CreativeDLab startup @BenchSci uses #AI to create 'super scientists' #UofT

0

4

14

0

1

Cem Anil

@cem__anil

3 years

@mengyer @nyuniversity @CILVRatNYU @NYUDataScience @zemelgroup @RaquelUrtasun Congrats Mengye!

1

0

1

Cem Anil

@cem__anil

3 months

@agarwl_ @hu_yifei @arankomatsuzaki Great work, congrats! Loads of interesting new/complementary findings in there, definitely worth a detailed read. DM me about Claude research access, will check what’s possible :)

0

1

Cem Anil

@cem__anil

2 years

@saakethmm The num of padding tokens on left/right is indeed random (total num of tokens is fixed) — to train all position embeddings even on short instances. We add the same num of padding tokens on both the input and scratchpad to keep the distance between the input and scratchpad fixed.

0

1