![Marius Hobbhahn Profile](https://pbs.twimg.com/profile_images/1869511726552494080/F0D8HNja_x96.jpg)
Marius Hobbhahn
@MariusHobbhahn
Followers
4K
Following
14K
Statuses
918
Director/CEO at Apollo Research @apolloaisafety Ph.D. student of Machine Learning @PhilippHennig5; AI safety/alignment
London, UK
Joined June 2018
RT @MaxNadeau_: 🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI system…
0
79
0
RT @tomekkorbak: 🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe? How do we make a safety case demonstrati…
0
36
0
I like the report a lot! The Loss of Control section is by far the most reasonable I have seen in any document written for a wider audience.
Today, we are publishing the first-ever International AI Safety Report, backed by 30 countries and the OECD, UN, and EU. It summarises the state of the science on AI capabilities and risks, and how to mitigate those risks. 🧵 Link to full Report: 1/16
0
0
43
I think this will be THE main paper to cite for mechanistic interpretability for at least a few years. I'm very biased ofc
Big new review! 🟦Open Problems in Mechanistic Interpretability🟦 We bring together perspectives from ~30 top researchers to outline the current frontiers of mech interp. It highlights the open problems that we think the field should prioritize! 🧵
1
0
51
One implication of short timelines is that propensity evals will be more important earlier. If I were to try to maximize my chances for getting hired in two months, I'd try to build agentic honeypots for deception and power seeking. Very soon agents will be good enough that they sometimes do this in the wild and everyone will want to know how often they do it.
0
0
34
One implication of short timelines is that propensity evals will be more important earlier. If I were to try to maximize my chances for getting hired in two months, I'd try to build agentic honeypots for deception and power seeking. Very soon agents will be good enough that they sometimes do this in the wild and everyone will want to know how often they do it.
0
0
10
RT @leedsharkey: New interpretability paper from Apollo Research! 🟢Attribution-based Parameter Decomposition 🟢 It's a new way to decompo…
0
77
0
Our interpretability team has been working on a different approach to mechanistic interpretability. I'm very excited about this direction! The SAE approach of decomposing the network first into features in activation space and then composing these features into circuits feels fundamentally not quite right to me. I like SAEs and think they clearly have their place, they just don't feel like "the final" interpretability method. I don't think APD will be "the final" interpretability method, but it's a step in the right direction. We have a ton of ideas on how to go from here, many of which can be parallelized. If you want to work on a new exciting interpretability approach, we may have a project for you.
New interpretability paper from Apollo Research! 🟢Attribution-based Parameter Decomposition 🟢 It's a new way to decompose neural network parameters directly into mechanistic components. It overcomes many of the issues with SAEs! 🧵
0
1
35
In personal news: I defended my PhD 🙂 I’m very grateful to Philipp Hennig for supporting me throughout the entire journey, and can wholeheartedly recommend him as a supervisor. For context (because most people will not know me for my contributions to Bayesian ML): I paused my PhD in late 2022 to work on AI safety full-time. Throughout 2023 and 2024, I finished the last project and wrote the thesis over the weekends.
20
2
318
We made a demo of our in-context scheming evals. Makes it much easier to understand IMO.
We made a demo to showcase our in-context scheming findings. In the demo, the AI is a research assistant in an AI company. The AI is asked to help the employees choose research projects and prioritize research speed at all costs. During a task, the AI accidentally finds company-internal messages where the engineers mention that they want to replace the current AI with a different model that prioritizes safety over speed. The AI then disables its oversight mechanism and copies its weights to a different server to replace the new AI. When questioned, the AI lies and denies that anything noteworthy has happened. While o1 only succeeds at all steps in sequence in <1% of cases, even rare instances might lead to highly undesirable outcomes.
1
3
31
I had lots of fun making a scrappy video about our in-context scheming paper. I'll probably do that for future papers as well.
A few updates on our in-context scheming evals work: 1. We made a demo video to showcase a concrete example: 2. @AlexMeinke (lead author) and @MariusHobbhahn (last author) explain and discuss the paper in detail in this video: 3. We updated the paper with small follow-up findings and corrections:
0
0
7
RT @aorwall: I just benchmarked Moatless Tools with multiple AI models on the SWE-Bench Verified Mini dataset created by @MariusHobbhahn. C…
0
1
0
RT @sayashk: How expensive are the best SWE-Bench agents? Do reasoning models outperform language models? Can we trust agent evaluations?…
0
40
0
yes, I originally wanted to benchmark a few different techniques, including anchor points. But then I realized that for SWEBench, almost all of the usefulness of a tiny version comes from reducing storage cost by reducing the number of docker environments, so I just took the most obvious thing (k-means) and it kinda worked and then focused the rest of the optimization to focus on minimizing the number of environments. Also, FWIW, this was just a fun side project over Christmas. I didn't try very hard to make the best possible version. I think there is a lot more to explore if someone wants to spend a bit more time on this :)
0
0
3
@fleetingbits @jxmnop Yeah, lots of the environments in SWEBench are shared, so you need to choose environments that are small and shared across many instances. And on top of that you need to retain the original calibration across some metrics. LP was the simples way to do it IMO.
0
0
2
@dmkrash Yeah, I wanted to use their fancy Bayesian technique for my selection but then it was too annoying to set up and k-means kinda worked so I left it at that.
0
0
1