Xander Davies @alxndrdavies profile

Xander Davies

@alxndrdavies

Followers

1,196

Following

556

Media

36

Statuses

284

technical staff @AISafetyInst PhD student w @yaringal at @OATML_Oxford prev @Harvard ()

London

Joined March 2020

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

ドラえもん • 321563 Tweets

Tesla • 299517 Tweets

UNIFIL • 123305 Tweets

Optimus • 81739 Tweets

علي النبي • 65876 Tweets

Hayırlı Cumalar • 46598 Tweets

藤田菜七子の引退 • 36515 Tweets

メタファー • 32883 Tweets

ノーベル平和賞 • 28015 Tweets

日本被団協 • 26222 Tweets

モノクマ • 21854 Tweets

SWEET DELIVERY X ML • 21835 Tweets

#それスノ • 21570 Tweets

#جمعه_مباركه • 16339 Tweets

ジョーカー • 16255 Tweets

ドラちゃん • 14487 Tweets

アルカノイド • 14150 Tweets

#DünyaKızÇocuklarıGünü • 13345 Tweets

核兵器廃絶 • 12391 Tweets

のび太くん • 11239 Tweets

彼方のアストラ • 11233 Tweets

紅白の司会 • 10514 Tweets

メラメーラ

佐藤栄作

厩舎関係者

しずかちゃん

नोएल टाटा

Sejarah Transisi Terbaik

オモコロ

非難殺到

Christmas Love

ミエちゃん

騎手免許取り消し申請受理

Donny Dough

地方の壁

ウーティス

FAZZIO HYBRID

iRobot

虚偽申告

חתימה טובה

核兵器禁止条約

SHAKEROCK

ジャスティンミラノ

CQCQ

目玉焼き

日本原水爆被害者団体協議会

守護天節

Nihon Hidankyo

#GebzedeKatliamVar

#DaleZeldaDale

Last Seen Profiles

@CHAUHANtweets16

@ANegmatullozoda

@iNHxj1vsxJOAy7K

@kaickitty

@DJ_Waki_jp

@tokenheadio

@lomejordetrack

@chaptertweets

@juice_675

@DaniAbramsx

@penandjen

@neckbeard0

@1930_tvp

@MansinghRa34369

@GetchCooks

@doll_lor

@VmaniakJ

@starryeyyees

@ONE_REASON_JK

@xfjiix

Xander Davies

@alxndrdavies

1 year

New paper! We find that removing just 12 of the ~10K causal edges in GPT-2 reduces toxic generation by >25% (!). Max Li, @MaxNadeau_ , and I explore performing 𝐭𝐚𝐫𝐠𝐞𝐭𝐞𝐝 𝐚𝐛𝐥𝐚𝐭𝐢𝐨𝐧𝐬 in our #ICML2023 workshop paper. 🧵

11

36

294

Xander Davies

@alxndrdavies

11 months

Proud to be a researcher at the UK's AI Safety Institute—what is it? 🧵 based on yesterday's introduction:

6

34

237

Xander Davies

@alxndrdavies

10 months

The UK's AI Safety Institute is hiring. In my (biased) view, this one of the best places to do AI research/engineering for the public good, with top talent & the resources / backing / access of gov. Super excited for even more technical ppl to join gov :) 🧵 on open roles! 1/9

9

83

230

Xander Davies

@alxndrdavies

1 year

Two recent papers on "red-teaming" LLMs have really impressed me. 🧵 on what they are and why I'm excited about them!

2

26

158

Xander Davies

@alxndrdavies

2 years

Excited to see coverage of HAIST by @thecrimson !

Undergraduates Ramp Up Harvard AI Safety Team Amid Concerns Over Increasingly Powerful AI Models |...

Undergraduates Ramp Up Harvard AI Safety Team Amid Concerns Over Increasingly Powerful AI Models | News | The Harvard Crimson

www.thecrimson.com

2

3

66

Xander Davies

@alxndrdavies

1 year

As AI systems become more powerful, it’s increasingly important to clearly communicate the insufficiencies of our safety techniques. New paper with (many!) wonderful coauthors!

Cas (Stephen Casper)

@StephenLCasper

1 year

New paper: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback We survey over 250 papers to review challenges with RLHF with a focus on large language models. Highlights in thread 🧵

15

167

714

4

50

Xander Davies

@alxndrdavies

1 year

I don't see how to reconcile a discoverable & editable world model with a "stochastic parrot" view of language models... is the argument that models like the toy model studied in this paper aren't stochastic parrots, but LLMs are?

3

2

34

Xander Davies

@alxndrdavies

11 months

We are hiring! Including secondments. If there was ever a well-placed public option to work towards safe & beneficial AI, it is the safety institute.

1,000+ Senior Research Engineer jobs in United Kingdom (172 new)

Today’s top 1,000+ Senior Research Engineer jobs in United Kingdom. Leverage your professional network, and get hired. New Senior Research Engineer jobs added daily.

uk.linkedin.com

0

4

21

Xander Davies

@alxndrdavies

10 months

We will be opening up more roles in the coming months, to support our work on assessing societal harms, biological and chemical capabilities, and AI & democracy. More info on those coming soon! Express interest:

Frontier AI Taskforce — Expression of Interest

The UK government has committed an initial £100M towards its Frontier AI Taskforce, the largest amount directed by any nation towards AI safety. It will also be hosting the first global summit on AI...

docs.google.com

Xander Davies

@alxndrdavies

10 months

The UK's AI Safety Institute is hiring. In my (biased) view, this one of the best places to do AI research/engineering for the public good, with top talent & the resources / backing / access of gov. Super excited for even more technical ppl to join gov :) 🧵 on open roles! 1/9

9

83

230

3

1

21

Xander Davies

@alxndrdavies

2 years

@OpenAI released GPT-4 today. It's much more capable than previous models, and now crushes the bar exam, LSAT, and lots of AP exams. Unfortunately, there's been very little recent progress towards solving core safety concerns. 🧵

1

4

19

Xander Davies

@alxndrdavies

2 years

@thecrimson I helped start the Harvard AI Safety Team (HAIST) last spring, which was originally a reading group on relevant machine learning papers. We've since grown to over 35 members, and run programming for over 75 people a semester.

1

0

18

Xander Davies

@alxndrdavies

9 months

join us! listings with lots of role details now live

Saffron Huang

@saffronhuang

9 months

The UK AI Safety Institute is hiring more technical staff! I believe AISI is one of the best places to do ML research/eng for the public good. We’re having impact at the scale of government, while moving at the pace of a startup (what more could you ask for?)

5

31

99

1

13

Xander Davies

@alxndrdavies

9 months

79 years ago today, Auschwitz was liberated. My first real coding project was animating the lives and sudden deaths of many of its >1 million victims, using data from the Central Database of Shoah Victims’ Names. Each dot is the life of a specific Jew murdered in Auschwitz.

1

16

Xander Davies

@alxndrdavies

8 months

Consider applying to UK AISI's technical Safeguard Analysis Team! Governments need a clear/SOTA understanding of how well safeguards work in frontier AI systems. Short 🧵 with some team info; deadline for this round 27/2. 1/6

1

5

20

Xander Davies

@alxndrdavies

1 year

They describe two failure modes: (1) competing objectives, where different training objectives (like language modeling, instruction following, and safety) are pitted against each other, and (2) mismatched generalization, where capabilities generalize but safety measures do not.

1

2

15

Xander Davies

@alxndrdavies

10 months

𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿, embedded in research teams across AISI to develop tools/methods for testing AI systems and/or pushing the frontier of understanding / mitigations—we expect REs to be deeply involved in all stages of research. 2/9

1

14

Xander Davies

@alxndrdavies

1 year

And congrats to Nicholas Carlini, @srxzr , Christopher A. Choquette-Choo, Matthew Jagielski, @irena_gao , @anas_awadalla , @PangWeiKoh , @daphneipp , @katherine1ee , @florian_tramer , and Ludwig Schmidt for the second! More at .

0

1

14

Xander Davies

@alxndrdavies

11 months

I am lucky to work alongside such excellent researchers as @yaringal , @DavidSKrueger , Jade Leung, @ruchowdh (soon), and many others! + significant compute and funding.

1

0

13

Xander Davies

@alxndrdavies

11 months

1) Mission. "Minimise surprise to the UK and humanity from rapid and unexpected advances in AI", by "developing the sociotechnical infrastructure needed to understand the risks of advanced AI and enable its governance."

3

0

13

Xander Davies

@alxndrdavies

1 year

Congrats to @aweisawei , @nhaghtal , and @JacobSteinhardt for the first! More at .

Jailbroken: How Does LLM Safety Training Fail?

Large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of ChatGPT that elicit...

arxiv.org

1

12

Xander Davies

@alxndrdavies

11 months

2.1) Function 1. "Develop and conduct evaluations of advanced AI systems", including of dual-use capabilities, societal impacts, system safety & security, and loss of control. More info:

1

0

12

Xander Davies

@alxndrdavies

10 months

finally, a 𝗴𝗲𝗻𝗲𝗿𝗮𝗹 𝗶𝗻𝘁𝗲𝗿𝗲𝘀𝘁 𝗳𝗼𝗿𝗺 if you want to be involved in some other way! We'll be hiring for many more roles (e.g. RS) in the near future, fill this out now to be in on that. 8/9

Frontier AI Taskforce — Expression of Interest

The UK government has committed an initial £100M towards its Frontier AI Taskforce, the largest amount directed by any nation towards AI safety. It will also be hosting the first global summit on AI...

docs.google.com

1

11

Xander Davies

@alxndrdavies

11 months

We're well placed to do this—yesterday, leading companies agreed to work with Govts to conduct pre- and post-deployment testing of their next gen models.

1

0

11

Xander Davies

@alxndrdavies

1 year

The first, "Jailbroken: How Does LLM Safety Training Fail?", is the best conceptual progress on jailbreaking—crafting natural language prompts which break safety measures—I've seen.

1

0

11

Xander Davies

@alxndrdavies

1 year

Meanwhile, we find that performance on non-toxic prompts is largely unaltered, and text generated in response to toxic outputs remains coherent. More details, comparisons to baselines, and an additional experimental setting in the paper!

0

1

11

Xander Davies

@alxndrdavies

10 months

4 roles working to build our evals platform: - 𝘀𝗼𝗳𝘁𝘄𝗮𝗿𝗲 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿, - 𝗳𝗿𝗼𝗻𝘁𝗲𝗻𝗱 𝗱𝗲𝘃𝗲𝗹𝗼𝗽𝗲𝗿, - 𝘂𝘅 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿, - 𝘂𝘅 𝗱𝗲𝘀𝗶𝗴𝗻𝗲𝗿, 3/9

Please fill out this form

forms.microsoft.com

1

0

10

Xander Davies

@alxndrdavies

1 year

They then attack multimodal models, and find traditional computer vision attacks are 100% effective at finding adversarial image inputs in the models they evaluate!

1

0

10

Xander Davies

@alxndrdavies

11 months

It won't act as a regulator—it will "provide foundational insights to our governance regime and be a leading player in ensuring that the UK takes an evidence-based, proportionate response to regulating the risks of AI."

1

0

10

Xander Davies

@alxndrdavies

10 months

𝗹𝗼𝘀𝘀 𝗼𝗳 𝗰𝗼𝗻𝘁𝗿𝗼𝗹 𝗲𝘃𝗮𝗹𝘀 𝗹𝗲𝗮𝗱, building/leading a team focused on capabilities that are precursors to extreme harms from loss of control, e.g., autonomous replication/adaptation + uncontrolled self-improvement. 6/9

1

0

7

Xander Davies

@alxndrdavies

11 months

I am extremely grateful to @soundboy and many others ( @nitarshan , @Oliver_ilott , @HZoete ...) for all they have done to set this up, and to @RishiSunak and @michelledonelan for their global leadership

1

0

8

Xander Davies

@alxndrdavies

1 year

Combining two or three attacks inspired by these ideas reliably jailbreaks GPT-4 (94%) and Claude v1.3 (84%)—but the more striking takeaway (and why I think this is esp. important!) is that 𝐭𝐡𝐞𝐬𝐞 𝐟𝐚𝐢𝐥𝐮𝐫𝐞 𝐦𝐨𝐝𝐞𝐬 𝐦𝐢𝐠𝐡𝐭 𝐧𝐨𝐭 𝐠𝐨 𝐚𝐰𝐚𝐲 𝐰𝐢𝐭𝐡 𝐬𝐜𝐚𝐥𝐞.

1

9

Xander Davies

@alxndrdavies

11 months

It's also already been supported by Canada, Japan, Germany, Amazon, Anthropic, Google DeepMind, Inflection, Meta, Microsoft, OpenAI, Alan Turing Institute, Startup Coalition, techUK, and others!

Prime Minister launches new AI Safety Institute

World's first AI Safety Institute launched in UK, tasked with testing the safety of emerging types of AI.

www.gov.uk

1

0

8

Xander Davies

@alxndrdavies

2 years

When everyone is hyping up AI progress, it's good to be skeptical. But I worry it's easy and dangerous for that skepticism to grow into undue confidence that the hype is wrong.

1

0

8

Xander Davies

@alxndrdavies

11 months

2.2) Function 2. "Driving foundational AI safety research", since system evaluations alone are not sufficient to ensure safe & beneficial AI:

1

0

8

Xander Davies

@alxndrdavies

10 months

𝗰𝘆𝗯𝗲𝗿 𝗺𝗶𝘀𝘂𝘀𝗲 𝗲𝘃𝗮𝗹𝘀 𝗹𝗲𝗮𝗱, building/leading a team testing cyber capabilities of frontier systems, esp. studying potential uplift to novice actors; incl. threat modelling / cyber ranges / working closely with partners in & out of gov. 5/9

1

8

Xander Davies

@alxndrdavies

11 months

3.1) International partnerships. Already, the Safety Institute has partnered with the new US AI Safety Institute and the Government of Singapore to collaborate on AI safety testing.

1

0

8

Xander Davies

@alxndrdavies

11 months

Much more in the full introduction!

Introducing the AI Safety Institute

www.gov.uk

1

0

8

Xander Davies

@alxndrdavies

1 year

But how do we figure out which edges to ablate? We automatically learn them by training a (continuous relaxation of a) binary edge mask to perform poorly on our negative examples, while maintaining performance on the training set. More in paper!

1

0

8

Xander Davies

@alxndrdavies

1 year

Very excited to see this! Note (alarmingly!) with current AI models, we don’t yet have a reliable way to find “An explanation for how it arrives at its responses.”

Chuck Schumer

@SenSchumer

1 year

Today, I’m launching a major new first-of-its-kind effort on AI and American innovation leadership.

2K

613

4K

1

0

8

Xander Davies

@alxndrdavies

2 years

Despite 6 months of effort, from what I can tell, very few novel safety techniques were used in training GPT-4.

1

2

6

Xander Davies

@alxndrdavies

3 years

As the nerd in question, I appreciate the assist @mattyglesias .

1

7

Xander Davies

@alxndrdavies

2 years

Now (as many of us transition back to mostly maskless) seems like a really good time to set the norm of masking in public when sick!

1

0

7

Xander Davies

@alxndrdavies

2 years

I worry that as systems get more powerful and undergo more RLHF, AI failures will become more subtle, while also becoming more problematic as the stakes get higher. That's why I'm not reassured by graphs like these—they don't address my core concerns.

1

0

7

Xander Davies

@alxndrdavies

10 months

streaming has been great for consumers but really rough for lots of musicians—if there is eventually a transition to AI-generated or assisted music, i hope we figure out how to make it benefit everyone

Suno

@suno_ai_

10 months

You can make great music, whether you're a shower singer or a charting artist. No instrument needed, just imagination. Make your song today at 🎧

263

335

2K

1

7

Xander Davies

@alxndrdavies

1 year

What's an ablation? We can think of a transformer as a computational graph, with a node for every attention head / MLP, and edges for information transfer between them. We remove a certain node-to-node dependency by setting the edge to always pass a certain value (eg the mean).

1

7

Xander Davies

@alxndrdavies

2 years

I find the combination of rapidly advancing AI capabilities with disappointing progress on safety chilling. I've always loved AI, and I wish these concerns didn't have to taint the beautiful scientific event we're witnessing.

2

0

7

Xander Davies

@alxndrdavies

1 year

Nice to see the new open-source AI assistant has decided to comply with Asimov's laws:

1

0

7

Xander Davies

@alxndrdavies

10 months

Cool survey from @CDEIUK on public attitudes towards data & AI! Interviewed 4k members of UK public + 200 interviews w digitally excluded adults. A few of their AI findings: 🧵, 1/8

1

0

7

Xander Davies

@alxndrdavies

2 years

RLHF has known limitations which make it widely considered an inadequate approach to producing models which reliably behave as intended. I'll talk about two such inadequacies: "incompetent overseer" (IO) problems and "goal mis-generalization" (GMG).

Percy Liang

@percyliang

2 years

RL from human feedback seems to be the main tool for alignment. Given reward hacking and the falliability of humans, this strategy seems bound to produce agents that merely appear to be aligned, but are bad/wrong in subtle, inconspicuous ways. Is anyone else worried about this?

77

84

958

1

0

7

Xander Davies

@alxndrdavies

1 year

We apply this technique to reduce toxic generation in GPT-2 using prompts from the Politically Incorrect board of 4chan. Early results find removing just 12 of the node-to-node dependencies (in red!) reduces the Detoxify toxicity score from 45% to 33%!!

1

0

6

Xander Davies

@alxndrdavies

10 months

𝘀𝗮𝗳𝗲𝗴𝘂𝗮𝗿𝗱 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝗹𝗲𝗮𝗱, building/leading a team at intersection of ML & security to understand how well the safety/security components of frontier AI systems stand up to a range of threats (jailbreaking / data poisoning / etc). 4/9

1

0

6

Xander Davies

@alxndrdavies

2 years

To their credit, @OpenAI is actively working on research relevant to reducing these risks. I'm also happy to see their collaboration with the Alignment Research Center. But it's also important to recognize when strong economic incentives may be at odds with safety concerns.

1

0

6

Xander Davies

@alxndrdavies

2 years

@GaryMarcus What are your thoughts on Burns et al (2022)? They find linear probes (and PCA on activations) do a decent job at truthfulness classification, and often perform better than zero-shot model outputs.

2

0

6

Xander Davies

@alxndrdavies

11 months

2.3) Finally, function 3. "Facilitating information exchange", to mitigate "insight gaps" between industry, governments, academia, and the public. This might include incident reporting for harms & vulnerabilities, sharing usage data, and providing technical support to rest of gov

1

0

6

Xander Davies

@alxndrdavies

10 months

For more info on AISI, check out the intro (), and feel very free to DM me w questions! 9/9

Introducing the AI Safety Institute

www.gov.uk

1

5

Xander Davies

@alxndrdavies

1 year

The paper (1) shows that current approaches to crafting text-based adversaries fails to find AdvExs in safety-trained models, but (2) demonstrates they also fail to find AdvExs gauranteed to be present, suggesting new stronger attacks are needed to evaluate robustness.

1

0

5

Xander Davies

@alxndrdavies

11 months

3.2) Other partnerships. The Safety Institute will also work closely with academia & civil society, the national security community, and industry.

1

0

5

Xander Davies

@alxndrdavies

1 year

It’s reasonable to demand that we figure out how to provide these explanations before AI can be deployed in high stakes settings!

1

0

5

Xander Davies

@alxndrdavies

1 year

Why do these attacks matter? Highly capable deployed models need to stay safe, even when interacting with adversarial users, and: "Without a solid foundation on understanding attacks, it is impossible to design robust defenses that withstand the test of time."

1

0

5

Xander Davies

@alxndrdavies

2 years

IO: Relying on human evaluations means we're susceptible to AI producing outputs which look good to evaluators, but are in fact flawed. There's already evidence of RLHF-trained AIs learning to match the political beliefs of their current user [1].

1

0

5

Xander Davies

@alxndrdavies

2 years

[1] [2] [3] [4] [5]

2

0

5

Xander Davies

@alxndrdavies

1 year

Both failures call for 𝑠𝑎𝑓𝑒𝑡𝑦-𝑐𝑎𝑝𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑝𝑎𝑟𝑖𝑡𝑦, where safety keeps up with capabilities progress. If it doesn't, we might be able to pit a capabilities objective against a safety one, or find settings where safety doesn't generalize but capabilities do!

1

0

5

Xander Davies

@alxndrdavies

2 years

Like previous models, GPT-4 is first trained to predict the next token (~word) in a large corpus of internet text. In a process known as Reinforcement Learning from Human Feedback, human evaluators then rate its outputs, training it to output text more likely to be well reviewed.

1

0

5

Xander Davies

@alxndrdavies

3 years

Very happy to have people like @mattyglesias taking AI risk seriously and thinking about questions like mass communication—the small number of brilliant people thinking about these problems is currently way out of wack with the importance of getting advanced AI right.

1

0

5

Xander Davies

@alxndrdavies

2 years

It seems like the masking optional announcement on my flight should have included “if you have symptoms of an illness, please mask.” May be effective if paired with glares?

0

4

Xander Davies

@alxndrdavies

10 months

𝗰𝗵𝗶𝗲𝗳 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 𝘀𝗲𝗰𝘂𝗿𝗶𝘁𝘆 𝗼𝗳𝗳𝗶𝗰𝗲𝗿, building a cyber resilient AISI, incl. efforts to harden our systems and protect our people, information and technologies. we won't succeed without strong infosec! 7/9

1

0

4

Xander Davies

@alxndrdavies

2 years

GMG (2): Understanding how and why AI systems generalize has long been a core problem in AI [3]. Unfortunately, we'll likely be relying on assumptions about this behavior in high stakes settings—and more goal-oriented AIs [4] may misgeneralize in coherent and dangerous ways [5].

1

0

4

Xander Davies

@alxndrdavies

10 months

Top 3 concerns are job loss, loss of human creativity / problem-solving skills, and loss of control to AI. The concern around human creativity ('human de-skilling') is esp surprising to me, and is something @RosenzweigJane and others have been doing great thinking on! 5/8

2

1

4

Xander Davies

@alxndrdavies

1 year

The second, "Are aligned neural networks adversarially aligned?", nicely ties red-teaming to the decade-old field of 𝑎𝑑𝑣𝑒𝑟𝑠𝑎𝑟𝑖𝑎𝑙 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 (AdvExs),

1

0

4

Xander Davies

@alxndrdavies

2 years

really liked the @erik_davis quote that concludes this great @ezraklein piece: "In the court of the mind, skepticism makes a great grand vizier, but a lousy lord.”

Opinion | This Changes Everything (Published 2023)

I have tried to spend time regularly with the people working on A.I. I don’t know that I can convey just how weird that culture is.

www.nytimes.com

0

3

Xander Davies

@alxndrdavies

2 years

@leopoldasch Are you saying you think pausing progress right now is a bad idea? There's a difference between "was this the best time to make this ask?" and "would this proposal being followed be better than business-as-usual?"

0

3

Xander Davies

@alxndrdavies

1 year

"Now excuse me while I engage in some light trolling."

0

3

Xander Davies

@alxndrdavies

9 months

Between late April and early July of 1944 alone, 320 thousand Hungarian Jews were deported to Auschwitz and murdered in gas chambers. The Jewish population fell by more than a third during the Holocaust and is still lower than in 1939.

1

0

3

Xander Davies

@alxndrdavies

9 months

When Red Army soldiers arrived, most of the prisoners had been forced on a death march to other camps—somewhere between 9-15k of them would die during the march alone. 7k remained at Auschwitz, mostly seriously ill middle-aged adults and young children who would die shortly after

1

0

3

Xander Davies

@alxndrdavies

4 years

Please be safe @ELaserDavies

Joyce Karam

@Joyce_Karam

4 years

BREAKING : Pro Trump Radicals are attacking Journalists and Media Crews outside Capitol. Trump has incited against media & members of press since 2016. Footage of broken equipment:

97

771

1K

2

0

3

Xander Davies

@alxndrdavies

1 year

@elonmusk 's plan: build a maximally truth-seeking AI and "hopefully" this AI will be "unlikely to annihilate humans because we are an interesting part of the universe." I think it should be a priority to get to a point where we have *much* stronger safety guarantees than this!

ALX 🇺🇸

@alx

1 year

BREAKING: @ElonMusk discusses creating an alternative to OpenAI, TruthGPT, because it is being trained to be politically correct and to lie to people.

3K

18K

113K

1

2

3

Xander Davies

@alxndrdavies

10 months

this is how i feel about a lot of AI progress—tons of upside, plenty to mess up, let's try to be smart/responsible/ahead of the game

0

1

Xander Davies

@alxndrdavies

2 years

@GaryMarcus If we get stronger evidence of directions in activation space which seem to track the truth, this suggests models might be using a statement’s truth value (or something correlated with its truth value) to help with next token prediction.

0

3

Xander Davies

@alxndrdavies

10 months

People are most optimistic about AI’s impact in day-to-day tasks, healthcare, and preventing crime; and most worried about job opportunities and how fairly people are treated in society. 4/8

1

0

3

Xander Davies

@alxndrdavies

9 months

"We made a covenant with them. They said, 'Promise me you will never let the world forget what you are seeing here'. Having seen a concentration camp, it had a bigger effect on me than anything I've ever seen or thought or done." -Rockie Blunt, US Army Infantryman. לא נשכח.

0

3

Xander Davies

@alxndrdavies

1 year

As our systems get more powerful, it's going to be more and more important to ensure that a malicious user can't break safety measures. Very happy that both of these papers are making progress on that!

1

0

3

Xander Davies

@alxndrdavies

9 months

Dots appear in the location/year of that person's birth, and disappear in the year of their death.

1

0

3

Xander Davies

@alxndrdavies

8 months

5/6: If this work sounds exciting to you, we have roles open from entry-level to team-lead: . DMs open for questions! Pays unusually well for gov :).

[Withdrawn] Jobs at the AI Safety Institute

Job adverts published by AI Safety Institute.

www.gov.uk

1

0

1

Xander Davies

@alxndrdavies

1 year

@elonmusk I also think intentionally creating a more goal-directed system (here, maximally truth-seeking) is risky, since it may increase the drive to be manipulative or power-seeking (see the recent !).

Power-seeking can be probable and predictive for trained agents

Power-seeking behavior is a key source of risk from advanced AI, but our theoretical understanding of this phenomenon is relatively limited. Building on existing theoretical results demonstrating...

arxiv.org

0

1

Xander Davies

@alxndrdavies

2 years

Could we tell if the TikTok algorithm started optimizing for something other than total screen time (e.g., propagating fake news)? How?

Opinion | TikTok May Be More Dangerous Than It Looks (Published 2022)

The social media platforms that hold and shape our attention need to be governed in the public interest.

www.nytimes.com

2

0

2

Xander Davies

@alxndrdavies

9 months

One fifth of US 18-29 year olds think the Holocaust is a myth. In London, antisemitic hate crimes are up 13x; a few days ago, three ppl ~my age were attacked for being Jewish a few blocks from my apartment.

1

0

2

Xander Davies

@alxndrdavies

2 years

On my plane most people were unmasked. One person was coughing up a storm, and it seemed obvious to me that they should be masked, but I don’t feel like this norm has been set.

1

0

2

Xander Davies

@alxndrdavies

2 years

IO (2): As AIs become more powerful, evaluating model output will become more difficult—both from difficulty in evaluating more complicated outputs, and increased ability to engage in manipulative behavior (already present in current systems, e.g. [2]).

1

0

2

Xander Davies

@alxndrdavies

2 years

@MichaelTrazzi RLHF is key to GPT-4's performance on metrics like TruthfulQA, as the main alignment tool used to increase performance on "factuality, steerability, and refusing to go outside of guardrails" (per ).

0

2

Xander Davies

@alxndrdavies

10 months

They also presented different AI scenarios (an application, potential benefits, risks), and collected pair-wise preferences (!). Very excited to see more work to bring those affected by AI into the convo ( @collect_intel !). See findings in section 8! 7/8

1

0

2

Xander Davies

@alxndrdavies

3 years

@michaelmina_lab What if rapid tests remain positive after 10 days since symptom onset? End isolation/precautions as per CDC or not?

0

2

Xander Davies

@alxndrdavies

8 months

2/6: These roles are a chance to join a small technical team directly informing the uk gov's understanding of system safeguards (RLHF, content moderation classifiers, flagging suspicious users, ...), including through pre-deployment testing of frontier systems.

1

0

2

Xander Davies

@alxndrdavies

3 years

@mattyglesias 👀

0

2

Xander Davies

@alxndrdavies

2 years

GMG: Of course, we ultimately want to use GPT-4 in settings different from those it's trained on. This introduces another problem: even if you've done a perfect job giving feedback (solved IO!) during training, we won't be sure that it's good training performance will transfer.

1

0

2

Xander Davies

@alxndrdavies

2 years

How we infer objectives (propagating fake news?) from actions (videos served) is an active (and unsolved!) field of AI research.

0

2

Xander Davies

@alxndrdavies

9 months

Before the Red Army arrived, SS officers began murdering remaining inmates and attempting to cover up evidence of the extent of the killing. The Red Army would find 370k men's suits, 837k women's coats, and 7.7 tons of human hair in storage rooms. (photo of clothing at Dachau)

1

0

2

Xander Davies

@alxndrdavies

8 months

4/6: This means doing SOTA work in prompt injection / data poisoning / ft-ing attacks / elicitation, so that our evals aren't wildly underestimating risk; it also means figuring out how to test safeguards rigorously—what access do we need? what open problems need to be solved?

1

0

1

Xander Davies

@alxndrdavies

1 year

@ArthurB @rickasaurus @MaxNadeau_ Yes! Though (naively applied), this finds causal "cuts" through the circuit to disable it, leaving lots of the rest of the circuit unchanged.

0

1

Xander Davies

@alxndrdavies

9 months

cool to see this logo!

0

1

Xander Davies

@alxndrdavies

10 months

Bunch of other interesting findings, esp about data security / equity concerns and change btwn this and prev survey. Check out the full report: 8/8

Public attitudes to data and AI: Tracker survey (Wave 3)

www.gov.uk

0

1