Xander Davies Profile Banner
Xander Davies Profile
Xander Davies

@alxndrdavies

Followers
1,196
Following
556
Media
36
Statuses
284

technical staff @AISafetyInst PhD student w @yaringal at @OATML_Oxford prev @Harvard ()

London
Joined March 2020
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
@alxndrdavies
Xander Davies
1 year
New paper! We find that removing just 12 of the ~10K causal edges in GPT-2 reduces toxic generation by >25% (!). Max Li, @MaxNadeau_ , and I explore performing 𝐭𝐚𝐫𝐠𝐞𝐭𝐞𝐝 𝐚𝐛𝐥𝐚𝐭𝐢𝐨𝐧𝐬 in our #ICML2023 workshop paper. 🧵
Tweet media one
11
36
294
@alxndrdavies
Xander Davies
11 months
Proud to be a researcher at the UK's AI Safety Institute—what is it? 🧵 based on yesterday's introduction:
Tweet media one
6
34
237
@alxndrdavies
Xander Davies
10 months
The UK's AI Safety Institute is hiring. In my (biased) view, this one of the best places to do AI research/engineering for the public good, with top talent & the resources / backing / access of gov. Super excited for even more technical ppl to join gov :) 🧵 on open roles! 1/9
9
83
230
@alxndrdavies
Xander Davies
1 year
Two recent papers on "red-teaming" LLMs have really impressed me. 🧵 on what they are and why I'm excited about them!
2
26
158
@alxndrdavies
Xander Davies
1 year
As AI systems become more powerful, it’s increasingly important to clearly communicate the insufficiencies of our safety techniques. New paper with (many!) wonderful coauthors!
@StephenLCasper
Cas (Stephen Casper)
1 year
New paper: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback We survey over 250 papers to review challenges with RLHF with a focus on large language models. Highlights in thread 🧵
Tweet media one
15
167
714
4
4
50
@alxndrdavies
Xander Davies
1 year
I don't see how to reconcile a discoverable & editable world model with a "stochastic parrot" view of language models... is the argument that models like the toy model studied in this paper aren't stochastic parrots, but LLMs are?
3
2
34
@alxndrdavies
Xander Davies
11 months
We are hiring! Including secondments. If there was ever a well-placed public option to work towards safe & beneficial AI, it is the safety institute.
0
4
21
@alxndrdavies
Xander Davies
10 months
We will be opening up more roles in the coming months, to support our work on assessing societal harms, biological and chemical capabilities, and AI & democracy. More info on those coming soon! Express interest:
@alxndrdavies
Xander Davies
10 months
The UK's AI Safety Institute is hiring. In my (biased) view, this one of the best places to do AI research/engineering for the public good, with top talent & the resources / backing / access of gov. Super excited for even more technical ppl to join gov :) 🧵 on open roles! 1/9
9
83
230
3
1
21
@alxndrdavies
Xander Davies
2 years
@OpenAI released GPT-4 today. It's much more capable than previous models, and now crushes the bar exam, LSAT, and lots of AP exams. Unfortunately, there's been very little recent progress towards solving core safety concerns. 🧵
1
4
19
@alxndrdavies
Xander Davies
2 years
@thecrimson I helped start the Harvard AI Safety Team (HAIST) last spring, which was originally a reading group on relevant machine learning papers. We've since grown to over 35 members, and run programming for over 75 people a semester.
1
0
18
@alxndrdavies
Xander Davies
9 months
join us! listings with lots of role details now live
@saffronhuang
Saffron Huang
9 months
The UK AI Safety Institute is hiring more technical staff! I believe AISI is one of the best places to do ML research/eng for the public good. We’re having impact at the scale of government, while moving at the pace of a startup (what more could you ask for?)
Tweet media one
5
31
99
1
1
13
@alxndrdavies
Xander Davies
9 months
79 years ago today, Auschwitz was liberated. My first real coding project was animating the lives and sudden deaths of many of its >1 million victims, using data from the Central Database of Shoah Victims’ Names. Each dot is the life of a specific Jew murdered in Auschwitz.
1
1
16
@alxndrdavies
Xander Davies
8 months
Consider applying to UK AISI's technical Safeguard Analysis Team! Governments need a clear/SOTA understanding of how well safeguards work in frontier AI systems. Short 🧵 with some team info; deadline for this round 27/2. 1/6
Tweet media one
1
5
20
@alxndrdavies
Xander Davies
1 year
They describe two failure modes: (1) competing objectives, where different training objectives (like language modeling, instruction following, and safety) are pitted against each other, and (2) mismatched generalization, where capabilities generalize but safety measures do not.
Tweet media one
1
2
15
@alxndrdavies
Xander Davies
10 months
𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿, embedded in research teams across AISI to develop tools/methods for testing AI systems and/or pushing the frontier of understanding / mitigations—we expect REs to be deeply involved in all stages of research. 2/9
Tweet media one
1
1
14
@alxndrdavies
Xander Davies
1 year
And congrats to Nicholas Carlini, @srxzr , Christopher A. Choquette-Choo, Matthew Jagielski, @irena_gao , @anas_awadalla , @PangWeiKoh , @daphneipp , @katherine1ee , @florian_tramer , and Ludwig Schmidt for the second! More at .
0
1
14
@alxndrdavies
Xander Davies
11 months
I am lucky to work alongside such excellent researchers as @yaringal , @DavidSKrueger , Jade Leung, @ruchowdh (soon), and many others! + significant compute and funding.
Tweet media one
1
0
13
@alxndrdavies
Xander Davies
11 months
1) Mission. "Minimise surprise to the UK and humanity from rapid and unexpected advances in AI", by "developing the sociotechnical infrastructure needed to understand the risks of advanced AI and enable its governance."
3
0
13
@alxndrdavies
Xander Davies
11 months
2.1) Function 1. "Develop and conduct evaluations of advanced AI systems", including of dual-use capabilities, societal impacts, system safety & security, and loss of control. More info:
Tweet media one
Tweet media two
1
0
12
@alxndrdavies
Xander Davies
10 months
finally, a 𝗴𝗲𝗻𝗲𝗿𝗮𝗹 𝗶𝗻𝘁𝗲𝗿𝗲𝘀𝘁 𝗳𝗼𝗿𝗺 if you want to be involved in some other way! We'll be hiring for many more roles (e.g. RS) in the near future, fill this out now to be in on that. 8/9
1
1
11
@alxndrdavies
Xander Davies
11 months
We're well placed to do this—yesterday, leading companies agreed to work with Govts to conduct pre- and post-deployment testing of their next gen models.
Tweet media one
1
0
11
@alxndrdavies
Xander Davies
1 year
The first, "Jailbroken: How Does LLM Safety Training Fail?", is the best conceptual progress on jailbreaking—crafting natural language prompts which break safety measures—I've seen.
1
0
11
@alxndrdavies
Xander Davies
1 year
Meanwhile, we find that performance on non-toxic prompts is largely unaltered, and text generated in response to toxic outputs remains coherent. More details, comparisons to baselines, and an additional experimental setting in the paper!
0
1
11
@alxndrdavies
Xander Davies
10 months
4 roles working to build our evals platform: - 𝘀𝗼𝗳𝘁𝘄𝗮𝗿𝗲 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿, - 𝗳𝗿𝗼𝗻𝘁𝗲𝗻𝗱 𝗱𝗲𝘃𝗲𝗹𝗼𝗽𝗲𝗿, - 𝘂𝘅 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿, - 𝘂𝘅 𝗱𝗲𝘀𝗶𝗴𝗻𝗲𝗿, 3/9
1
0
10
@alxndrdavies
Xander Davies
1 year
They then attack multimodal models, and find traditional computer vision attacks are 100% effective at finding adversarial image inputs in the models they evaluate!
Tweet media one
1
0
10
@alxndrdavies
Xander Davies
11 months
It won't act as a regulator—it will "provide foundational insights to our governance regime and be a leading player in ensuring that the UK takes an evidence-based, proportionate response to regulating the risks of AI."
1
0
10
@alxndrdavies
Xander Davies
10 months
𝗹𝗼𝘀𝘀 𝗼𝗳 𝗰𝗼𝗻𝘁𝗿𝗼𝗹 𝗲𝘃𝗮𝗹𝘀 𝗹𝗲𝗮𝗱, building/leading a team focused on capabilities that are precursors to extreme harms from loss of control, e.g., autonomous replication/adaptation + uncontrolled self-improvement. 6/9
Tweet media one
1
0
7
@alxndrdavies
Xander Davies
11 months
I am extremely grateful to @soundboy and many others ( @nitarshan , @Oliver_ilott , @HZoete ...) for all they have done to set this up, and to @RishiSunak and @michelledonelan for their global leadership
1
0
8
@alxndrdavies
Xander Davies
1 year
Combining two or three attacks inspired by these ideas reliably jailbreaks GPT-4 (94%) and Claude v1.3 (84%)—but the more striking takeaway (and why I think this is esp. important!) is that 𝐭𝐡𝐞𝐬𝐞 𝐟𝐚𝐢𝐥𝐮𝐫𝐞 𝐦𝐨𝐝𝐞𝐬 𝐦𝐢𝐠𝐡𝐭 𝐧𝐨𝐭 𝐠𝐨 𝐚𝐰𝐚𝐲 𝐰𝐢𝐭𝐡 𝐬𝐜𝐚𝐥𝐞.
1
1
9
@alxndrdavies
Xander Davies
11 months
It's also already been supported by Canada, Japan, Germany, Amazon, Anthropic, Google DeepMind, Inflection, Meta, Microsoft, OpenAI, Alan Turing Institute, Startup Coalition, techUK, and others!
1
0
8
@alxndrdavies
Xander Davies
2 years
When everyone is hyping up AI progress, it's good to be skeptical. But I worry it's easy and dangerous for that skepticism to grow into undue confidence that the hype is wrong.
1
0
8
@alxndrdavies
Xander Davies
11 months
2.2) Function 2. "Driving foundational AI safety research", since system evaluations alone are not sufficient to ensure safe & beneficial AI:
Tweet media one
Tweet media two
1
0
8
@alxndrdavies
Xander Davies
10 months
𝗰𝘆𝗯𝗲𝗿 𝗺𝗶𝘀𝘂𝘀𝗲 𝗲𝘃𝗮𝗹𝘀 𝗹𝗲𝗮𝗱, building/leading a team testing cyber capabilities of frontier systems, esp. studying potential uplift to novice actors; incl. threat modelling / cyber ranges / working closely with partners in & out of gov. 5/9
Tweet media one
1
1
8
@alxndrdavies
Xander Davies
11 months
3.1) International partnerships. Already, the Safety Institute has partnered with the new US AI Safety Institute and the Government of Singapore to collaborate on AI safety testing.
Tweet media one
Tweet media two
1
0
8
@alxndrdavies
Xander Davies
11 months
Much more in the full introduction!
1
0
8
@alxndrdavies
Xander Davies
1 year
But how do we figure out which edges to ablate? We automatically learn them by training a (continuous relaxation of a) binary edge mask to perform poorly on our negative examples, while maintaining performance on the training set. More in paper!
1
0
8
@alxndrdavies
Xander Davies
1 year
Very excited to see this! Note (alarmingly!) with current AI models, we don’t yet have a reliable way to find “An explanation for how it arrives at its responses.”
@SenSchumer
Chuck Schumer
1 year
Today, I’m launching a major new first-of-its-kind effort on AI and American innovation leadership.
2K
613
4K
1
0
8
@alxndrdavies
Xander Davies
2 years
Despite 6 months of effort, from what I can tell, very few novel safety techniques were used in training GPT-4.
1
2
6
@alxndrdavies
Xander Davies
3 years
As the nerd in question, I appreciate the assist @mattyglesias .
1
1
7
@alxndrdavies
Xander Davies
2 years
Now (as many of us transition back to mostly maskless) seems like a really good time to set the norm of masking in public when sick!
1
0
7
@alxndrdavies
Xander Davies
2 years
I worry that as systems get more powerful and undergo more RLHF, AI failures will become more subtle, while also becoming more problematic as the stakes get higher. That's why I'm not reassured by graphs like these—they don't address my core concerns.
Tweet media one
1
0
7
@alxndrdavies
Xander Davies
10 months
streaming has been great for consumers but really rough for lots of musicians—if there is eventually a transition to AI-generated or assisted music, i hope we figure out how to make it benefit everyone
@suno_ai_
Suno
10 months
You can make great music, whether you're a shower singer or a charting artist. No instrument needed, just imagination. Make your song today at 🎧
263
335
2K
1
1
7
@alxndrdavies
Xander Davies
1 year
What's an ablation? We can think of a transformer as a computational graph, with a node for every attention head / MLP, and edges for information transfer between them. We remove a certain node-to-node dependency by setting the edge to always pass a certain value (eg the mean).
1
1
7
@alxndrdavies
Xander Davies
2 years
I find the combination of rapidly advancing AI capabilities with disappointing progress on safety chilling. I've always loved AI, and I wish these concerns didn't have to taint the beautiful scientific event we're witnessing.
2
0
7
@alxndrdavies
Xander Davies
1 year
Nice to see the new open-source AI assistant has decided to comply with Asimov's laws:
Tweet media one
1
0
7
@alxndrdavies
Xander Davies
10 months
Cool survey from @CDEIUK on public attitudes towards data & AI! Interviewed 4k members of UK public + 200 interviews w digitally excluded adults. A few of their AI findings: 🧵, 1/8
Tweet media one
1
0
7
@alxndrdavies
Xander Davies
2 years
RLHF has known limitations which make it widely considered an inadequate approach to producing models which reliably behave as intended. I'll talk about two such inadequacies: "incompetent overseer" (IO) problems and "goal mis-generalization" (GMG).
@percyliang
Percy Liang
2 years
RL from human feedback seems to be the main tool for alignment. Given reward hacking and the falliability of humans, this strategy seems bound to produce agents that merely appear to be aligned, but are bad/wrong in subtle, inconspicuous ways. Is anyone else worried about this?
77
84
958
1
0
7
@alxndrdavies
Xander Davies
1 year
We apply this technique to reduce toxic generation in GPT-2 using prompts from the Politically Incorrect board of 4chan. Early results find removing just 12 of the node-to-node dependencies (in red!) reduces the Detoxify toxicity score from 45% to 33%!!
Tweet media one
1
0
6
@alxndrdavies
Xander Davies
10 months
𝘀𝗮𝗳𝗲𝗴𝘂𝗮𝗿𝗱 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝗹𝗲𝗮𝗱, building/leading a team at intersection of ML & security to understand how well the safety/security components of frontier AI systems stand up to a range of threats (jailbreaking / data poisoning / etc). 4/9
Tweet media one
1
0
6
@alxndrdavies
Xander Davies
2 years
To their credit, @OpenAI is actively working on research relevant to reducing these risks. I'm also happy to see their collaboration with the Alignment Research Center. But it's also important to recognize when strong economic incentives may be at odds with safety concerns.
1
0
6
@alxndrdavies
Xander Davies
2 years
@GaryMarcus What are your thoughts on Burns et al (2022)? They find linear probes (and PCA on activations) do a decent job at truthfulness classification, and often perform better than zero-shot model outputs.
2
0
6
@alxndrdavies
Xander Davies
11 months
2.3) Finally, function 3. "Facilitating information exchange", to mitigate "insight gaps" between industry, governments, academia, and the public. This might include incident reporting for harms & vulnerabilities, sharing usage data, and providing technical support to rest of gov
Tweet media one
1
0
6
@alxndrdavies
Xander Davies
10 months
For more info on AISI, check out the intro (), and feel very free to DM me w questions! 9/9
1
1
5
@alxndrdavies
Xander Davies
1 year
The paper (1) shows that current approaches to crafting text-based adversaries fails to find AdvExs in safety-trained models, but (2) demonstrates they also fail to find AdvExs gauranteed to be present, suggesting new stronger attacks are needed to evaluate robustness.
1
0
5
@alxndrdavies
Xander Davies
11 months
3.2) Other partnerships. The Safety Institute will also work closely with academia & civil society, the national security community, and industry.
Tweet media one
Tweet media two
1
0
5
@alxndrdavies
Xander Davies
1 year
It’s reasonable to demand that we figure out how to provide these explanations before AI can be deployed in high stakes settings!
1
0
5
@alxndrdavies
Xander Davies
1 year
Why do these attacks matter? Highly capable deployed models need to stay safe, even when interacting with adversarial users, and: "Without a solid foundation on understanding attacks, it is impossible to design robust defenses that withstand the test of time."
1
0
5
@alxndrdavies
Xander Davies
2 years
IO: Relying on human evaluations means we're susceptible to AI producing outputs which look good to evaluators, but are in fact flawed. There's already evidence of RLHF-trained AIs learning to match the political beliefs of their current user [1].
1
0
5
@alxndrdavies
Xander Davies
2 years
[1] [2] [3] [4] [5]
2
0
5
@alxndrdavies
Xander Davies
1 year
Both failures call for 𝑠𝑎𝑓𝑒𝑡𝑦-𝑐𝑎𝑝𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑝𝑎𝑟𝑖𝑡𝑦, where safety keeps up with capabilities progress. If it doesn't, we might be able to pit a capabilities objective against a safety one, or find settings where safety doesn't generalize but capabilities do!
1
0
5
@alxndrdavies
Xander Davies
2 years
Like previous models, GPT-4 is first trained to predict the next token (~word) in a large corpus of internet text. In a process known as Reinforcement Learning from Human Feedback, human evaluators then rate its outputs, training it to output text more likely to be well reviewed.
1
0
5
@alxndrdavies
Xander Davies
3 years
Very happy to have people like @mattyglesias taking AI risk seriously and thinking about questions like mass communication—the small number of brilliant people thinking about these problems is currently way out of wack with the importance of getting advanced AI right.
1
0
5
@alxndrdavies
Xander Davies
2 years
It seems like the masking optional announcement on my flight should have included “if you have symptoms of an illness, please mask.” May be effective if paired with glares?
0
0
4
@alxndrdavies
Xander Davies
10 months
𝗰𝗵𝗶𝗲𝗳 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 𝘀𝗲𝗰𝘂𝗿𝗶𝘁𝘆 𝗼𝗳𝗳𝗶𝗰𝗲𝗿, building a cyber resilient AISI, incl. efforts to harden our systems and protect our people, information and technologies. we won't succeed without strong infosec! 7/9
Tweet media one
1
0
4
@alxndrdavies
Xander Davies
2 years
GMG (2): Understanding how and why AI systems generalize has long been a core problem in AI [3]. Unfortunately, we'll likely be relying on assumptions about this behavior in high stakes settings—and more goal-oriented AIs [4] may misgeneralize in coherent and dangerous ways [5].
1
0
4
@alxndrdavies
Xander Davies
10 months
Top 3 concerns are job loss, loss of human creativity / problem-solving skills, and loss of control to AI. The concern around human creativity ('human de-skilling') is esp surprising to me, and is something @RosenzweigJane and others have been doing great thinking on! 5/8
Tweet media one
2
1
4
@alxndrdavies
Xander Davies
1 year
The second, "Are aligned neural networks adversarially aligned?", nicely ties red-teaming to the decade-old field of 𝑎𝑑𝑣𝑒𝑟𝑠𝑎𝑟𝑖𝑎𝑙 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 (AdvExs),
1
0
4
@alxndrdavies
Xander Davies
2 years
really liked the @erik_davis quote that concludes this great @ezraklein piece: "In the court of the mind, skepticism makes a great grand vizier, but a lousy lord.”
0
0
3
@alxndrdavies
Xander Davies
2 years
@leopoldasch Are you saying you think pausing progress right now is a bad idea? There's a difference between "was this the best time to make this ask?" and "would this proposal being followed be better than business-as-usual?"
0
0
3
@alxndrdavies
Xander Davies
1 year
"Now excuse me while I engage in some light trolling."
Tweet media one
0
0
3
@alxndrdavies
Xander Davies
9 months
Between late April and early July of 1944 alone, 320 thousand Hungarian Jews were deported to Auschwitz and murdered in gas chambers. The Jewish population fell by more than a third during the Holocaust and is still lower than in 1939.
1
0
3
@alxndrdavies
Xander Davies
9 months
When Red Army soldiers arrived, most of the prisoners had been forced on a death march to other camps—somewhere between 9-15k of them would die during the march alone. 7k remained at Auschwitz, mostly seriously ill middle-aged adults and young children who would die shortly after
Tweet media one
1
0
3
@alxndrdavies
Xander Davies
4 years
Please be safe @ELaserDavies
@Joyce_Karam
Joyce Karam
4 years
BREAKING : Pro Trump Radicals are attacking Journalists and Media Crews outside Capitol. Trump has incited against media & members of press since 2016. Footage of broken equipment:
97
771
1K
2
0
3
@alxndrdavies
Xander Davies
1 year
@elonmusk 's plan: build a maximally truth-seeking AI and "hopefully" this AI will be "unlikely to annihilate humans because we are an interesting part of the universe." I think it should be a priority to get to a point where we have *much* stronger safety guarantees than this!
@alx
ALX 🇺🇸
1 year
BREAKING: @ElonMusk discusses creating an alternative to OpenAI, TruthGPT, because it is being trained to be politically correct and to lie to people.
3K
18K
113K
1
2
3
@alxndrdavies
Xander Davies
10 months
this is how i feel about a lot of AI progress—tons of upside, plenty to mess up, let's try to be smart/responsible/ahead of the game
0
0
1
@alxndrdavies
Xander Davies
2 years
@GaryMarcus If we get stronger evidence of directions in activation space which seem to track the truth, this suggests models might be using a statement’s truth value (or something correlated with its truth value) to help with next token prediction.
0
0
3
@alxndrdavies
Xander Davies
10 months
People are most optimistic about AI’s impact in day-to-day tasks, healthcare, and preventing crime; and most worried about job opportunities and how fairly people are treated in society. 4/8
Tweet media one
1
0
3
@alxndrdavies
Xander Davies
9 months
"We made a covenant with them. They said, 'Promise me you will never let the world forget what you are seeing here'. Having seen a concentration camp, it had a bigger effect on me than anything I've ever seen or thought or done." -Rockie Blunt, US Army Infantryman. לא נשכח.
Tweet media one
0
0
3
@alxndrdavies
Xander Davies
1 year
As our systems get more powerful, it's going to be more and more important to ensure that a malicious user can't break safety measures. Very happy that both of these papers are making progress on that!
1
0
3
@alxndrdavies
Xander Davies
9 months
Dots appear in the location/year of that person's birth, and disappear in the year of their death.
1
0
3
@alxndrdavies
Xander Davies
8 months
5/6: If this work sounds exciting to you, we have roles open from entry-level to team-lead: . DMs open for questions! Pays unusually well for gov :).
1
0
1
@alxndrdavies
Xander Davies
1 year
@elonmusk I also think intentionally creating a more goal-directed system (here, maximally truth-seeking) is risky, since it may increase the drive to be manipulative or power-seeking (see the recent !).
0
0
1
@alxndrdavies
Xander Davies
2 years
Could we tell if the TikTok algorithm started optimizing for something other than total screen time (e.g., propagating fake news)? How?
2
0
2
@alxndrdavies
Xander Davies
9 months
One fifth of US 18-29 year olds think the Holocaust is a myth. In London, antisemitic hate crimes are up 13x; a few days ago, three ppl ~my age were attacked for being Jewish a few blocks from my apartment.
1
0
2
@alxndrdavies
Xander Davies
2 years
On my plane most people were unmasked. One person was coughing up a storm, and it seemed obvious to me that they should be masked, but I don’t feel like this norm has been set.
1
0
2
@alxndrdavies
Xander Davies
2 years
IO (2): As AIs become more powerful, evaluating model output will become more difficult—both from difficulty in evaluating more complicated outputs, and increased ability to engage in manipulative behavior (already present in current systems, e.g. [2]).
1
0
2
@alxndrdavies
Xander Davies
2 years
@MichaelTrazzi RLHF is key to GPT-4's performance on metrics like TruthfulQA, as the main alignment tool used to increase performance on "factuality, steerability, and refusing to go outside of guardrails" (per ).
Tweet media one
0
0
2
@alxndrdavies
Xander Davies
10 months
They also presented different AI scenarios (an application, potential benefits, risks), and collected pair-wise preferences (!). Very excited to see more work to bring those affected by AI into the convo ( @collect_intel !). See findings in section 8! 7/8
Tweet media one
1
0
2
@alxndrdavies
Xander Davies
3 years
@michaelmina_lab What if rapid tests remain positive after 10 days since symptom onset? End isolation/precautions as per CDC or not?
0
0
2
@alxndrdavies
Xander Davies
8 months
2/6: These roles are a chance to join a small technical team directly informing the uk gov's understanding of system safeguards (RLHF, content moderation classifiers, flagging suspicious users, ...), including through pre-deployment testing of frontier systems.
1
0
2
@alxndrdavies
Xander Davies
3 years
0
0
2
@alxndrdavies
Xander Davies
2 years
GMG: Of course, we ultimately want to use GPT-4 in settings different from those it's trained on. This introduces another problem: even if you've done a perfect job giving feedback (solved IO!) during training, we won't be sure that it's good training performance will transfer.
1
0
2
@alxndrdavies
Xander Davies
2 years
How we infer objectives (propagating fake news?) from actions (videos served) is an active (and unsolved!) field of AI research.
0
0
2
@alxndrdavies
Xander Davies
9 months
Before the Red Army arrived, SS officers began murdering remaining inmates and attempting to cover up evidence of the extent of the killing. The Red Army would find 370k men's suits, 837k women's coats, and 7.7 tons of human hair in storage rooms. (photo of clothing at Dachau)
Tweet media one
1
0
2
@alxndrdavies
Xander Davies
8 months
4/6: This means doing SOTA work in prompt injection / data poisoning / ft-ing attacks / elicitation, so that our evals aren't wildly underestimating risk; it also means figuring out how to test safeguards rigorously—what access do we need? what open problems need to be solved?
1
0
1
@alxndrdavies
Xander Davies
1 year
@ArthurB @rickasaurus @MaxNadeau_ Yes! Though (naively applied), this finds causal "cuts" through the circuit to disable it, leaving lots of the rest of the circuit unchanged.
0
0
1
@alxndrdavies
Xander Davies
9 months
cool to see this logo!
Tweet media one
0
0
1
@alxndrdavies
Xander Davies
10 months
Bunch of other interesting findings, esp about data security / equity concerns and change btwn this and prev survey. Check out the full report: 8/8
0
0
1