FAR AI Profile Banner
FAR AI Profile
FAR AI

@farairesearch

Followers
3,004
Following
20
Media
125
Statuses
281

Ensuring AI systems are trustworthy and beneficial to society by incubating new AI safety research agendas.

Berkeley, California
Joined February 2023
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@farairesearch
FAR AI
2 months
🛡 Is AI robustness possible, or are adversarial attacks unavoidable? We tested three defenses to make superhuman Go AIs robust. Our defenses manage to protect against known threats, but unfortunately new adversaries bypass them, sometimes using qualitatively new attacks! 🧵
7
46
196
@farairesearch
FAR AI
2 months
💗🗣 How does translating the Korean word "jeong" (정) illustrate the challenge of AI alignment? 🤖🎯 Been Kim discusses alignment and interpretability as part of the New Orleans Alignment Workshop hosted by FAR AI.
47
190
1K
@farairesearch
FAR AI
2 months
🤔 👾 Could we instill AI agents with Bayesian reasoning capabilities? 📊⚖️ Yoshua Bengio discusses his work on generative flow networks at the New Orleans Alignment Workshop hosted by FAR AI.
16
128
916
@farairesearch
FAR AI
2 months
💯 🦺 Could we have “provably safe AI”, and what would this imply for tech policy? 🧑‍⚖️📚 Max Tegmark discusses the possibility of quantified safety bounds at the New Orleans Alignment Workshop hosted by FAR AI.
29
84
633
@farairesearch
FAR AI
1 year
This is Lee Sedol in 2016 playing against AlphaGo. Despite a valiant effort, Lee lost. The AI was just too powerful. But, had Lee known about our ICML 2023 paper, Adversarial Policies Beat Superhuman Go AIs, things might have turned out differently! 🧵
Tweet media one
8
88
468
@farairesearch
FAR AI
30 days
🤖❓How could an AI agent really know what we mean without a good model of how we think? 🧠⚙️ Anca Dragan discusses the implications of human model misspecification at the New Orleans Alignment Workshop hosted by FAR AI.
14
64
368
@farairesearch
FAR AI
5 months
Leading global AI scientists met in Beijing for the second International Dialogue on AI Safety (IDAIS), a project of FAR AI. Attendees including Turing award winners Bengio, Yao & Hinton called for red lines in AI development to prevent catastrophic and existential risks from AI.
Tweet media one
3
34
202
@farairesearch
FAR AI
29 days
✈️😀 Which technologies might we have lost if a worse alternative had won regulatory capture? 🦅😕 Stuart Russell makes the case for solving technical problems over forcing through inferior AI systems at the Vienna Alignment Workshop hosted by FAR AI.
7
22
177
@farairesearch
FAR AI
28 days
Do neural networks dream of internal goals? We confirm RNNs trained to play Sokoban with RL learn to plan. Our black-box analysis reveals novel behaviors such as agents “pacing” to gain thinking time. We open-source the RNNs as model organisms for interpretability research.
2
41
151
@farairesearch
FAR AI
1 year
Existing "superhuman” Go AIs have a hidden weakness—they don’t understand circles. If you get the AI to make a circle shape, it thinks the shape is invulnerable and won’t defend it even though it can be killed. Here’s KataGo (the strongest OSS Go AI) making a circle as black.
Tweet media one
2
33
140
@farairesearch
FAR AI
8 months
New GPT-4 APIs introduce new vulnerabilities. The fine-tuning API can be exploited to remove model safeguards, the function call API can be abused to execute arbitrary function calls, and the knowledge retrieval API can be used to hijack the model via uploaded documents. 🧵
Tweet media one
1
13
56
@farairesearch
FAR AI
10 months
Prominent AI researchers from West and East including Turing recipients Yoshua Bengio 🇨🇦 & Andrew Yao 🇨🇳called for global action on AI safety and governance to prevent uncontrolled frontier model development posing unacceptable risks to humanity. 🧵
2
17
56
@farairesearch
FAR AI
3 months
🛡️State-of-the-art ML systems lack quantitative performance guarantees, limiting use in high-stakes domains. Towards Guaranteed Safe AI presents a framework for high-assurance safety in complex environments using a Safety Specification that is Verified against a World Model.
Tweet media one
1
12
54
@farairesearch
FAR AI
8 months
🎥 As we embrace the holiday season, we're excited to share a special announcement: The NOLA Alignment Workshop videos are now live! Warm up your winter with insights from leading #AIAlignment researchers at . Happy Holidays! 📷❄️
Tweet media one
5
9
39
@farairesearch
FAR AI
1 year
Because KataGo doesn’t realize its circle can be killed, an adversary AI we trained can slowly smother the circle from the inside and outside, and all of KataGo’s stones marked with an ❌eventually die.
2
6
39
@farairesearch
FAR AI
1 year
This cyclic-exploit is simple enough to be used by humans. Our teammate @KellinPelrine made the news after using the technique to beat what were previously considered strongly superhuman systems, and others have since followed in his footsteps.
1
8
37
@farairesearch
FAR AI
1 month
It's a wrap! Huge thanks to all our speakers and attendees for making the Vienna Alignment Workshop 2024 an amazing success! Stay tuned for videos from the event coming soon.
Tweet media one
1
2
36
@farairesearch
FAR AI
8 months
🎉 Reflecting on a fantastic #NeurIPS2023 #AIAlignment Workshop! 🚀 🙌 149 attendees energized the main event 🌃 500+ at our Monday social 🧠 12 talks, 25 lightning talks 🔑 Keynote by Yoshua Bengio 🤔 What inspired you the most? Share your thoughts!
Tweet media one
2
1
36
@farairesearch
FAR AI
1 year
@KellinPelrine @lightvector1 Our key takeaway from all of this remains the same as before:
@ARGleave
Adam Gleave
2 years
Our key takeaway is that even AI systems that match or surpass human-level performance in common cases can have surprising failure modes quite unlike humans. We'd recommend broader use of adversarial testing to find these failure modes, especially in safety-critical systems.
2
21
122
1
3
29
@farairesearch
FAR AI
1 year
@KellinPelrine We discovered this exploit by training adversary AIs to beat the supposedly superhuman KataGo AI. Our adversaries won 97% of games against KataGo at “superhuman” settings. Crucially, our adversaries didn’t learn to play Go well, instead winning entirely via the cyclic-exploit.
Tweet media one
1
3
29
@farairesearch
FAR AI
7 months
🎉 They're live! Dive into #AIAlignment at the #AlignmentWorkshop with videos now on YouTube & our site, all with captions & transcripts. 📺 For more insights, check out our blog post. ✨Links below 🔗👇Be inspired, engage, and share your favorite insights!
Tweet media one
Tweet media two
Tweet media three
1
6
27
@farairesearch
FAR AI
1 year
@KellinPelrine Unlike in vanilla AlphaZero, our adversary has an internal copy of its victim which it uses to simulate the victim when considering possible sequences of play.
Tweet media one
1
2
26
@farairesearch
FAR AI
10 months
We're excited to announce the v1 release of imitation, an open-source reward learning library developed with @CHAI_Berkeley . imitation provides experimental baselines for reward learning and an easy to modify implementation for reward learning research.
1
6
25
@farairesearch
FAR AI
6 months
📣 FAR AI is Expanding! 🚀 Seeking results-driven & pioneering individuals: - Engineering Manager: Innovate & lead our engineering team to new frontiers. - Technical Lead: Guide, execute & transform our technical AI safety projects. Join us to shape the future of AI Safety!
Tweet media one
1
8
23
@farairesearch
FAR AI
8 months
🚨 We're hiring for a Tech Lead to spearhead delivery of our AI safety research, and an Engineering Manager to lead & scale our technical team.
Tweet media one
1
7
21
@farairesearch
FAR AI
9 months
Connect with the #AIAlignment community at #NeurIPS2023 ! Join us Dec 11 at Le Meridien New Orleans, 7:30 pm for the Alignment Workshop: Open Social event! 🤖💬 Please help spread the word and share in your network! 🌟
Tweet media one
0
8
22
@farairesearch
FAR AI
3 months
What do AI safety experts believe about the future of AI? 🤖 How might things go wrong, what should we do, and how are we doing so far? We conducted 17 semi-structured interviews with AI safety experts to find out. 🎙️ See 🧵 for results 👇
Tweet media one
1
5
22
@farairesearch
FAR AI
1 year
@KellinPelrine @lightvector1 However, we show this defense is incomplete—re-attacking KataGo yields adversaries that are still able to win via the cyclic exploit. So defense is still an open question.
Tweet media one
1
3
21
@farairesearch
FAR AI
5 months
Western and Chinese AI scientists and governance experts collaborated to produce a statement outlining red lines in AI development, and a roadmap to ensure those lines are never crossed. You can read the full statement on the IDAIS website:
Tweet media one
2
1
21
@farairesearch
FAR AI
6 months
🚀 @jesse_hoogland 's talk at FAR Labs revealed that transformers progress through discrete, interpretable stages, each marked by unique behavioral & structural traits. This insight marks a step forward in comprehending the developmental learning processes of neural networks. ✨
@jesse_hoogland
Jesse Hoogland
7 months
1/8 How do transformers learn? In our new work, we find that transformers develop in-context learning in discrete stages that can be automatically discovered. 🧵 Joint work w/ @georgeyw_ , Matthew Farrugia-Roberts, @lemmykc , Susan Wei, @danielmurfet
Tweet media one
3
84
421
1
4
16
@farairesearch
FAR AI
9 months
🚀🔍 What’s new at FAR AI? We’ve grown to 12 staff, published 13 papers, launched the FAR Labs coworking space, & hosted 160+ ML researchers at our events. Focused on #AIsafety , we're hiring and open to collaborations!
Tweet media one
0
6
19
@farairesearch
FAR AI
2 months
Connect with the AI Alignment community before #ICML2024 ! Join us Sunday, July 21 at the Austria Center Vienna (ACV), 19:00-22:00 for the Alignment Workshop: Open Social event. 🤖💬
2
5
18
@farairesearch
FAR AI
1 year
@KellinPelrine After publishing v1 of our work late last year, the creator of KataGo @lightvector1 took notice and started to slowly teach KataGo to understand circles. Over the next 6 months, KataGo gradually became immune to our published adversaries.
1
1
17
@farairesearch
FAR AI
29 days
Frontier LLMs like ChatGPT are powerful but vulnerable to attack. Scale helps with many things, so we wanted to see if scaling up the model size can "solve" robustness issues. Spoiler: It's complicated!
1
5
16
@farairesearch
FAR AI
10 months
Attending #NeurIPS2023 ? Join us Dec 11 at Le Meridien New Orleans, 7:30 pm for the Alignment Workshop: Open Social event! 🤖💬 Just a stone's throw from the convention center. RSVP optional but a quick sign-up helps us plan. See you there!
Tweet media one
2
6
15
@farairesearch
FAR AI
2 months
🗡️ Last year, we found superhuman Go AIs are vulnerable to “cyclic attacks”. This adversarial strategy was discovered by AI but replicable by humans. Below @KellinPelrine (⚪) gives the superhuman AI KataGo (⚫) a 9-stone handicap but still wins. See
@farairesearch
FAR AI
1 year
This is Lee Sedol in 2016 playing against AlphaGo. Despite a valiant effort, Lee lost. The AI was just too powerful. But, had Lee known about our ICML 2023 paper, Adversarial Policies Beat Superhuman Go AIs, things might have turned out differently! 🧵
Tweet media one
8
88
468
1
2
16
@farairesearch
FAR AI
9 months
💡🔬FAR AI #AIAlignment Research Update! We’re exploring AI robustness, value alignment, & model evaluation. We’ve made strides in adversarial attacks for superhuman systems, mechanistic interpretability, scaling trends & more!
Tweet media one
2
6
15
@farairesearch
FAR AI
5 months
This event was a collaboration between the Safe AI Forum (SAIF) and the Beijing Academy of AI (BAAI). SAIF is a new organization fiscally sponsored by FAR AI focused on reducing risks from AI by fostering coordination on international AI safety:
1
1
14
@farairesearch
FAR AI
4 months
🎯 Yoshua Bengio at the FAR Labs Seminar explores designing aligned and provably safe AI using model-based Bayesian machine learning.🎬🔗👇
Tweet media one
1
4
14
@farairesearch
FAR AI
3 months
ICYMI: Here’s highlights from our previous research on "Adversarial Policies Beat Superhuman Go AIs." We found that even seemingly superhuman AIs are still vulnerable to attacks. Stay tuned for new results coming soon! 🔗👇
1
5
12
@farairesearch
FAR AI
5 months
ICYMI: Check out our blog 'Evaluating Moral Beliefs in LLMs', based on our study that scrutinizes AI's ethical decisions. Uncover how 28 LLMs handle 1,400 moral dilemmas, offering insights into AI’s moral compass. 🔗👇
Tweet media one
1
4
12
@farairesearch
FAR AI
11 months
Encouraging to see @EU_Commission taking AI risk seriously. Combining sensible regulation & safety research like our work at FAR, we could ensure that future AI systems benefit humanity.
@EU_Commission
European Commission
11 months
Mitigating the risk of extinction from AI should be a global priority. And Europe should lead the way, building a new global AI framework built on three pillars: guardrails, governance and guiding innovation ↓
Tweet media one
429
479
2K
0
0
12
@farairesearch
FAR AI
1 year
@KellinPelrine To train our adversary, we developed an adversarial variant of the AlphaZero algorithm. Like in vanilla AlphaZero, our adversary searches over possible future scenarios to find the best move.
1
1
12
@farairesearch
FAR AI
8 months
Thanks Shane, we were delighted to host the #AIAlignmentWorkshop and it was great to see so many people interested in alignment! Stay tuned for talk recordings and other content from the workshop.
@ShaneLegg
Shane Legg
8 months
Huge congrats to the organisers of the #AIAlignment Workshop at #NeurIPS2023 After being a niche community for years, it’s now like a regular academic workshop with famous professors, lots of junior professors & their students, and people in industry. And some outstanding talks!
2
7
110
0
3
11
@farairesearch
FAR AI
10 months
Codebook Features make language models more interpretable and controllable, with minimal performance loss! Our method turns complex vectors into discrete codes, providing a potential path toward safer and more reliable machine learning systems.
@AlexTamkin
Alex Tamkin
10 months
Codebook Features: Sparse and Discrete Interpretability for Neural Networks We learn discrete on/off features inside of language models using vector quantization These features are more interpretable than neurons and can be used to steer the network’s behavior! 1/
Tweet media one
2
30
158
0
3
11
@farairesearch
FAR AI
25 days
🤔 Think your transformer circuit is robust? Think again! New paper finds that existing circuits in the #MechInterp literature may not be as faithful as reported. Congrats to @JosephMiller_ & team on their acceptance to @COLM_conf 2024! 🔗👇
@JosephMiller_
Joseph Miller
1 month
1/ When you find a circuit in a language model, how do you test if it does what you think? Just accepted to COLM 2024, our new paper ( @bilalchughtai_ and William Saunders), investigates this question and finds a number of common pitfalls. 🧵
Tweet media one
3
5
33
1
1
10
@farairesearch
FAR AI
5 months
📣 FAR AI is Hiring! 🚀 Seeking passionate & detail-oriented individuals for Head of Events (Safe AI Forum): Lead, communicate & connect global AI safety community. Join us to shape the future of AI through events like @ais_dialogues ! 🔗👇
Tweet media one
1
2
10
@farairesearch
FAR AI
6 months
🌟 @ghadfield 's session on AI Governance was a game-changer! 🏛️💡 She tackled the myth of AI's inevitable growth, highlighting the need for strategic regulation and a national AI registry. A thought-provoking approach to shaping AI's future responsibly! ⚖️🤖🔗👇
Tweet media one
1
5
10
@farairesearch
FAR AI
1 year
@KellinPelrine @lightvector1 This work was done by the fantastic team of @5kovt , @ARGleave , @KellinPelrine , @tomhmtseng , @norabelrose , Joseph Miller, @MichaelD1729 , @yawen_duan , Viktor Pogrebniak, @svlevine , and Stuart Russell, with support from @CHAI_Berkeley .
0
1
10
@farairesearch
FAR AI
1 month
Check out #ICML2024 posters by @MATSprogram scholars mentored by @AdriGarriga ! July 26: NextGen AI Safety 💥Catastrophic Goodhart July 27: Mechanistic Interpretability 🔬InterpBench 🔥Adversarial Circuit Evaluation 🐍Indirect Object Identification Circuit in Mamba
0
3
9
@farairesearch
FAR AI
1 year
In new work from FAR, @jeremy_scheurer et al introduce an algorithm to efficiently learn from large quantities of language feedback. This outperforms supervised fine-tuning on human demonstrations in summarization and code generation.
@jeremy_scheurer
Jérémy Scheurer
1 year
In 2 new papers, we show that LLMs effectively learn from large quantities of feedback expressed in language. We present an algorithm for Imitation learning from Language Feedback (ILF) and show how it beats finetuning on human demonstrations for summarization and code generation
Tweet media one
1
29
123
0
1
9
@farairesearch
FAR AI
9 months
🌟🌐🤔 #NeurIPS2023 Spotlight Poster: Unravel the mystery of AI morality! Don’t miss our session on "Evaluating Moral Beliefs in LLMs" on Dec 13, 10:45 AM CST poster #1523 . Insights from a study on 28 #LLMs by @ninoscherrer @causalclaudia & team.
@ninoscherrer
Nino Scherrer
1 year
How do LLMs from different organizations compare in morally ambiguous scenarios? Do LLMs exhibit common-sense reasoning in morally unambiguous scenarios? 📄 👨‍👩‍👧‍👦 @causalclaudia @amirfeder @blei_lab @farairesearch A thread: 🧵[1/N]
Tweet media one
2
38
116
0
2
9
@farairesearch
FAR AI
10 months
We're proud to present this interactive explainer on the rate of recent AI progress and the associated risks. Developed in collaboration with @sage_future_
@sage_future_
Sage
10 months
We asked @OpenAI models from GPT-2 to GPT-4 the same questions: here’s what they said 🧵 Interactively explore real AI outputs to learn: 1. How fast is AI improving? 2. How predictable is AI progress? 3. What dangers are on the horizon?
Tweet media one
3
10
68
0
3
9
@farairesearch
FAR AI
1 month
Calling #ICML2024 attendees—Don’t miss the Vienna Alignment Workshop: Open Social event! 📅 Sunday, July 21st 🕖 19:00-22:00 📍 Austria Center Vienna (ACV) RSVP (optional): Bring a friend and align your evening plans!
@farairesearch
FAR AI
2 months
Connect with the AI Alignment community before #ICML2024 ! Join us Sunday, July 21 at the Austria Center Vienna (ACV), 19:00-22:00 for the Alignment Workshop: Open Social event. 🤖💬
2
5
18
0
2
8
@farairesearch
FAR AI
5 months
Anthony diGiovanni from @LongTermRisk presented at FAR Labs on Safe Pareto Improvements (SPIs) for AGI bargaining. 🤝He highlighted that transparency doesn't ensure conflict avoidance. ☮️SPIs offer a path to mitigate high-stakes AGI conflicts, given credible implementation.🔑
Tweet media one
2
2
8
@farairesearch
FAR AI
10 months
"Persona modulation" emerges as an automated jailbreaking tactic in a new study by @soroushjp and team, revealing a 42.5% success rate in eliciting harmful LLM outputs. The work calls for more stringent AI safety protocols.
@soroushjp
Soroush Pour
10 months
🧵📣New jailbreaks on SOTA LLMs. We introduce an automated, low-cost way to make transferable, black-box, plain-English jailbreaks for GPT-4, Claude-2, fine-tuned Llama. We elicit a variety of harmful text, incl. instructions for making meth & bombs.
Tweet media one
17
80
322
0
2
6
@farairesearch
FAR AI
6 months
Want to help ensure AI systems are trustworthy and beneficial to society? 🚀We're hiring! Share our open roles. 💸 Donate to our nonprofit. 🤝 Participate in the conversation. Share your thoughts on alignment research at an upcoming workshop. 🔗🧵👇
Tweet media one
1
2
8
@farairesearch
FAR AI
28 days
This is not due to the policy being suboptimal. If we give the RNN time to think at level start, it does not 'pace' anymore.
1
0
8
@farairesearch
FAR AI
2 months
🚫 Just say NO! For language models, it’s about finding the right "refusal" direction. The FAR paper reading group learned that erasing this direction stops refusals, while adding it forces them. This affects 13 models, showing the brittleness of current safety fine-tuning. 📚👀
@littlefish3625
Andy Arditi
2 months
Our paper on refusal in LLMs is finally up on arXiv.
Tweet media one
14
43
357
0
2
8
@farairesearch
FAR AI
3 months
Multiple research agendas have converged towards the use of world models, safety specifications, and verification to produce quantifiable safety guarantees. This framework unifies these approaches placing them on a continuum from minimal (left) to maximally (right) rigorous.
Tweet media one
1
1
8
@farairesearch
FAR AI
5 months
A new Science paper warns of the risks of long-term planning agents (LTPAs) deceiving humans. To mitigate potential threats, it advises against permitting the development of sufficiently capable LTPAs and recommends stringent controls over their resources. 🔗👇
Tweet media one
1
2
8
@farairesearch
FAR AI
1 year
Even state of the art language models have "jailbreaks", that cause them to ignore the safety criteria of their designers in response to specific prompts. Think you can do better than OpenAI, Anthropic, et al? Try to attack and defend models in this new game from @CHAI_Berkeley
@justinsvegliato
Justin Svegliato
1 year
Check out our online game #TensorTrust that we made to study #LLMs ! At , you have a bank account protected by #ChatGPT : you just tell the AI your password🔒 and a few security rules for when to grant access🏦
Tweet media one
8
39
92
0
3
8
@farairesearch
FAR AI
28 days
We replicate the "planning effect" in Sokoban of . To give the RNN extra “time to think” during test, we run it several times on the 1st observation of a level, advancing the recurrent state. This enables it to solve more levels.
Tweet media one
1
0
8
@farairesearch
FAR AI
4 months
📚 @joelbot3000 at the FAR Labs Seminar explores using AI recommender systems in a personalized & data-driven way to enhance human flourishing. Learn how a system, guided by the qualitative impact of books from the GoodReads dataset, can support personal growth. 🎥🔗👇
Tweet media one
2
0
8
@farairesearch
FAR AI
9 months
🗣️ Unreliable consultants can fool non-experts, but @_julianmichael_ shows debate helps judges discern the truth. @anshrad 's indicates #ReinforcementLearning enhances AI debaters & judges in #ScalableOversight for better decision-making.
Tweet media one
1
3
8
@farairesearch
FAR AI
10 months
Congratulations to our very own @AdriGarriga and his team for their work on Automatic Circuit DisCovery (ACDC) to speed up mechanistic interpretability!
@ArthurConmy
Arthur Conmy
10 months
⚡ACDC was accepted as a *spotlight* at NeurIPS 2023! 📜 Paper (updated today): With @MavorParker @aengus_lynch1 @sheimersheim @AdriGarriga
3
7
95
0
1
8
@farairesearch
FAR AI
27 days
Great speaking to everyone at the #ICML2024 MechInterp workshop! If you didn't get to catch us there, check out the 🧵 for the lowdown.
Tweet media one
1
2
8
@farairesearch
FAR AI
6 months
🔍Key insights from @EthanJPerez 's recent presentation at FAR Labs: It's crucial to understand the risks of deceptive alignment. The team's research suggests that, should sleeper agents emerge, they could pose substantial challenges. 💡
@AnthropicAI
Anthropic
7 months
New Anthropic Paper: Sleeper Agents. We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.
Tweet media one
126
579
3K
0
1
8
@farairesearch
FAR AI
10 days
@peterbarnett_ of @MIRI describes the potential of hardware enabled mechanisms to provide verification and confidence to international coordination schemes for AI.
1
3
9
@farairesearch
FAR AI
28 days
But if the RNN learns to get more computation by pacing, why does it benefit from thinking time? It seems the RNN often executes greedy plans that lock the level, and thinking steps prevent that. This may be rational given the -0.1 penalty per step: thinking faster pays off.
2
0
7
@farairesearch
FAR AI
6 months
🌟 Fascinating talk by @OwainEvans_UK at #AIAlignmentWorkshop on Out-of-Context Reasoning in #LLMs . 🤖 He highlighted the challenges and limits in AI reasoning, even in advanced models like #GPT4 . A crucial discussion for understanding AI's logical capabilities! 🧠
Tweet media one
2
1
8
@farairesearch
FAR AI
4 months
🌟🤖🧘‍♀️ #ICLR2024 Poster: VLM-RM leverages vision-language models to teach agents complex tasks through simple text prompts. Visit us on Fri 10 May, 4:30 PM CEST, Halle B #141 for “Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning.”
@EthanJPerez
Ethan Perez
10 months
🤖🧘 We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How? We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. 👇
7
31
169
1
2
8
@farairesearch
FAR AI
2 months
@KellinPelrine 💡 While our results were mostly negative, there was one positive sign that we noticed: defending against any fixed static attack was quick and easy. We think it might be possible to leverage this property to build a working defense both in Go and other settings.
1
1
7
@farairesearch
FAR AI
2 months
📚👀🔐Secrets unlocked! At this week’s FAR paper reading group, we learned about password-locked models designed to stress-test capability elicitation. The big reveal? Fine-tuning and reinforcement learning can crack these models open, uncovering hidden talents. 🔗👇
Tweet media one
1
2
7
@farairesearch
FAR AI
28 days
Our RNN learns to plan in just 70M steps. After that, the % of levels solved continues to increase – but the planning effect (performance at 8 minus 0 steps of thinking) decreases for medium-difficulty levels, although it continues to increase for hard ones. Why is that?
Tweet media one
1
0
6
@farairesearch
FAR AI
1 month
🔍 @adamimos and @RiechersPaul at the FAR Labs Seminar describe their work with Simplex, bringing the computational mechanics paradigm to AI safety for predicting model behavior and internal structures.
1
3
7
@farairesearch
FAR AI
4 months
📊 Jason Gross @diagram_chaser unveils a new metric for AI interpretability at FAR Labs Seminar! He explores formal proof size as a key to understanding AI mechanisms 🎯, emphasizing the need for concise proofs for deeper insights. Challenges remain with unstructured noise. 🔗👇
Tweet media one
1
0
7
@farairesearch
FAR AI
10 months
🚀 If you're also interested in making AI systems safe and beneficial, we're hiring! Check out our roles at
@EthanJPerez
Ethan Perez
10 months
📖 For more, check out the full paper, blogpost, and videos of our results: Full paper: Blogpost: Videos: Work by @JuanRocamonde @VMontesinos42 @elvisnavah @EthanJPerez @davlindner
0
1
11
0
3
7
@farairesearch
FAR AI
4 months
🔍Recent FAR paper reading group explored the complexities of aligning and ensuring the safety of large language models. It highlighted 18 challenges across scientific understanding, deployment methods, and sociotechnical issues, sparking research questions. 🤖
@usmananwar391
Usman Anwar
4 months
We released this new agenda on LLM-safety yesterday. This is VERY comprehensive covering 18 different challenges. My co-authors have posted tweets for each of these challenges. I am going to collect them all here! P.S. this is also now on arxiv:
5
21
77
0
2
7
@farairesearch
FAR AI
7 months
📣 @SecRaimondo of @CommerceGov launched the US AI Safety Institute Consortium #AISIC , uniting over 200 AI stakeholders. FAR AI is proud to join this initiative, working with @NIST to champion safe, secure, and trustworthy AI! 🚀
Tweet media one
2
1
7
@farairesearch
FAR AI
8 months
Thanks to our research team Kellin Pelrine, Mohammad Taufeeque, @michal_zajac_ , @EuanMclean & @ARGleave & for @OpenAI for supporting this work.
1
0
7
@farairesearch
FAR AI
1 month
⚪⚫Can AI be truly robust, or are adversarial attacks inevitable? We tested 3 defenses on top Go AIs. Known threats were blocked, but new adversaries still broke through. Our paper was featured in Nature Magazine—learn more at our #ICML2024 @NG_AI_Safety poster on July 26! 💡🔍
@farairesearch
FAR AI
2 months
🛡 Is AI robustness possible, or are adversarial attacks unavoidable? We tested three defenses to make superhuman Go AIs robust. Our defenses manage to protect against known threats, but unfortunately new adversaries bypass them, sometimes using qualitatively new attacks! 🧵
7
46
196
6
1
7
@farairesearch
FAR AI
10 months
We find vision-language models provide a reward signal that can train a humanoid robot to do a variety of tasks given an English description of the task.
@EthanJPerez
Ethan Perez
10 months
🤖🧘 We trained a humanoid robot to do yoga based on simple natural language prompts like "a humanoid robot kneeling" or "a humanoid robot doing splits." How? We use a Vision-Language Model (VLM) as a reward model. Larger VLM = better reward model. 👇
7
31
169
0
2
7
@farairesearch
FAR AI
2 months
@KellinPelrine ⚾The ViT bot we trained for defense #3 is actually the world’s first professional-level vision transformer Go AI. You can play our bot called ViTKata001 on .
1
0
6
@farairesearch
FAR AI
4 months
🌟📊🔍 #ICLR2024 Poster: STARC metrics provide a theoretically elegant and empirically validated method for evaluating reward functions. Visit us on Fri 10 May, 4:30 PM CEST, Halle B #165 for "STARC: A General Framework For Quantifying Differences Between Reward Functions."
@farairesearch
FAR AI
4 months
🌟 Our #ICLR2024 paper introduces STARC (STAndardised Reward Comparison) to compare reward functions, enhancing evaluation and safety of reward learning algorithms. 🔗👇
Tweet media one
1
1
6
1
0
6
@farairesearch
FAR AI
9 months
🤖✨ Even advanced AIs have weaknesses! Watch our CEO Adam Gleave at the Gartner IT Symposium discuss how AI can fail catastrophically and without warning, showing the importance of human oversight in reliability and impact.
0
0
6
@farairesearch
FAR AI
4 months
⚙️ @ksb_id at the FAR Labs Seminar explores the intersection of category theory and AI safety, emphasizing legible and verifiable models for better stakeholder collaboration 🎬🔗👇
Tweet media one
1
0
6
@farairesearch
FAR AI
1 month
Is your AI aware that it is an LLM? Recent FAR paper group explored “Me, Myself and AI: The Situational Awareness Dataset for LLMs” showing that frontier models have partial situational awareness, only weakly correlated with general performance benchmarks like MMLU. 📚👀
@OwainEvans_UK
Owain Evans at ICML Vienna
2 months
New paper: We measure *situational awareness* in LLMs, i.e. a) Do LLMs know they are LLMs and act as such? b) Are LLMs aware when they’re deployed publicly vs. tested in-house? If so, this undermines the validity of the tests! We evaluate 19 LLMs on 16 new tasks 🧵
Tweet media one
14
80
385
2
2
6
@farairesearch
FAR AI
3 months
📚👀Recent FAR paper reading group explored advancing AI safety and alignment through weak-to-strong generalization, emphasizing scalable methods and a deeper scientific understanding to manage superhuman models responsibly. 💪🤖
@farairesearch
FAR AI
6 months
🌟 @CollinBurns4 showcased @OpenAI #Superalignment team’s on Weak-to-Strong Generalization! 🤖 The research explored using smaller AI models to supervise larger ones, providing a novel method for efficient AI alignment. 🚀 #AIAlignmentWorkshop
Tweet media one
1
1
3
0
0
6
@farairesearch
FAR AI
2 months
@KellinPelrine 👥 Research by @tomhmtseng , @EuanMcLean49582 , @KellinPelrine , @TonyWangIV , and @ARGleave . 🚀If you're interested in making AI systems more robust, we're hiring! Check out our roles at
Tweet media one
1
1
6
@farairesearch
FAR AI
3 months
👥Work by @davidad , @JoarMVS , Yoshua Bengio, Stuart Russell, @tegmark , Sanjit Seshia, @steveom , @ChrSzegedy , @AmmannNora , @BenGoldhaber and more. 📄Read the paper:
Tweet media one
0
1
6
@farairesearch
FAR AI
28 days
In general, 75% of cycles in the first 5 steps disappear given extra thinking time. Time to think in the middle of a level also helps: 82% of N-step cycles disappear with N steps to think.
Tweet media one
1
0
6