MariusHobbhahn Profile Banner
Marius Hobbhahn Profile
Marius Hobbhahn

@MariusHobbhahn

Followers
4K
Following
14K
Statuses
918

Director/CEO at Apollo Research @apolloaisafety Ph.D. student of Machine Learning @PhilippHennig5; AI safety/alignment

London, UK
Joined June 2018
Don't wanna be here? Send us removal request.
@MariusHobbhahn
Marius Hobbhahn
4 days
RT @MaxNadeau_: 🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI system…
0
79
0
@MariusHobbhahn
Marius Hobbhahn
6 days
I'll be in Paris for the AI action summit from 7th to 11th. If you want to meet, reach out. *This is not a promise that I will have time for everyone
0
0
9
@MariusHobbhahn
Marius Hobbhahn
11 days
RT @tomekkorbak: 🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe? How do we make a safety case demonstrati…
0
36
0
@MariusHobbhahn
Marius Hobbhahn
12 days
I like the report a lot! The Loss of Control section is by far the most reasonable I have seen in any document written for a wider audience.
@Yoshua_Bengio
Yoshua Bengio
12 days
Today, we are publishing the first-ever International AI Safety Report, backed by 30 countries and the OECD, UN, and EU. It summarises the state of the science on AI capabilities and risks, and how to mitigate those risks. 🧵 Link to full Report: 1/16
0
0
43
@MariusHobbhahn
Marius Hobbhahn
12 days
I think this will be THE main paper to cite for mechanistic interpretability for at least a few years. I'm very biased ofc
@leedsharkey
Lee Sharkey
12 days
Big new review! 🟦Open Problems in Mechanistic Interpretability🟦 We bring together perspectives from ~30 top researchers to outline the current frontiers of mech interp. It highlights the open problems that we think the field should prioritize! 🧵
Tweet media one
1
0
51
@MariusHobbhahn
Marius Hobbhahn
14 days
One implication of short timelines is that propensity evals will be more important earlier. If I were to try to maximize my chances for getting hired in two months, I'd try to build agentic honeypots for deception and power seeking. Very soon agents will be good enough that they sometimes do this in the wild and everyone will want to know how often they do it.
0
0
34
@MariusHobbhahn
Marius Hobbhahn
14 days
One implication of short timelines is that propensity evals will be more important earlier. If I were to try to maximize my chances for getting hired in two months, I'd try to build agentic honeypots for deception and power seeking. Very soon agents will be good enough that they sometimes do this in the wild and everyone will want to know how often they do it.
0
0
10
@MariusHobbhahn
Marius Hobbhahn
14 days
RT @leedsharkey: New interpretability paper from Apollo Research! 🟢Attribution-based Parameter Decomposition 🟢 It's a new way to decompo…
0
77
0
@MariusHobbhahn
Marius Hobbhahn
14 days
Our interpretability team has been working on a different approach to mechanistic interpretability. I'm very excited about this direction! The SAE approach of decomposing the network first into features in activation space and then composing these features into circuits feels fundamentally not quite right to me. I like SAEs and think they clearly have their place, they just don't feel like "the final" interpretability method. I don't think APD will be "the final" interpretability method, but it's a step in the right direction. We have a ton of ideas on how to go from here, many of which can be parallelized. If you want to work on a new exciting interpretability approach, we may have a project for you.
@leedsharkey
Lee Sharkey
14 days
New interpretability paper from Apollo Research! 🟢Attribution-based Parameter Decomposition 🟢 It's a new way to decompose neural network parameters directly into mechanistic components. It overcomes many of the issues with SAEs! 🧵
Tweet media one
0
1
35
@MariusHobbhahn
Marius Hobbhahn
16 days
In personal news: I defended my PhD 🙂 I’m very grateful to Philipp Hennig for supporting me throughout the entire journey, and can wholeheartedly recommend him as a supervisor. For context (because most people will not know me for my contributions to Bayesian ML): I paused my PhD in late 2022 to work on AI safety full-time. Throughout 2023 and 2024, I finished the last project and wrote the thesis over the weekends.
Tweet media one
20
2
318
@MariusHobbhahn
Marius Hobbhahn
18 days
Also known as "long timelines" these days ;)
@tsarnick
Tsarathustra
18 days
Google DeepMind CEO Demis Hassabis says AGI that is robust across all cognitive tasks and can invent its own hypotheses and conjectures about science is 3-5 years away
4
4
130
@MariusHobbhahn
Marius Hobbhahn
18 days
We made a demo of our in-context scheming evals. Makes it much easier to understand IMO.
@apolloaisafety
Apollo Research
18 days
We made a demo to showcase our in-context scheming findings. In the demo, the AI is a research assistant in an AI company. The AI is asked to help the employees choose research projects and prioritize research speed at all costs. During a task, the AI accidentally finds company-internal messages where the engineers mention that they want to replace the current AI with a different model that prioritizes safety over speed. The AI then disables its oversight mechanism and copies its weights to a different server to replace the new AI. When questioned, the AI lies and denies that anything noteworthy has happened. While o1 only succeeds at all steps in sequence in <1% of cases, even rare instances might lead to highly undesirable outcomes.
1
3
31
@MariusHobbhahn
Marius Hobbhahn
18 days
I had lots of fun making a scrappy video about our in-context scheming paper. I'll probably do that for future papers as well.
@apolloaisafety
Apollo Research
18 days
A few updates on our in-context scheming evals work: 1. We made a demo video to showcase a concrete example: 2. @AlexMeinke (lead author) and @MariusHobbhahn (last author) explain and discuss the paper in detail in this video: 3. We updated the paper with small follow-up findings and corrections:
0
0
7
@MariusHobbhahn
Marius Hobbhahn
22 days
RT @aorwall: I just benchmarked Moatless Tools with multiple AI models on the SWE-Bench Verified Mini dataset created by @MariusHobbhahn. C…
0
1
0
@MariusHobbhahn
Marius Hobbhahn
25 days
RT @sayashk: How expensive are the best SWE-Bench agents? Do reasoning models outperform language models? Can we trust agent evaluations?…
0
40
0
@MariusHobbhahn
Marius Hobbhahn
29 days
Rarely got burned this hard 🔥 Though in typical Claude style, it apologized in the next message for using the term "obscure figure".
Tweet media one
0
0
18
@MariusHobbhahn
Marius Hobbhahn
1 month
yes, I originally wanted to benchmark a few different techniques, including anchor points. But then I realized that for SWEBench, almost all of the usefulness of a tiny version comes from reducing storage cost by reducing the number of docker environments, so I just took the most obvious thing (k-means) and it kinda worked and then focused the rest of the optimization to focus on minimizing the number of environments. Also, FWIW, this was just a fun side project over Christmas. I didn't try very hard to make the best possible version. I think there is a lot more to explore if someone wants to spend a bit more time on this :)
0
0
3
@MariusHobbhahn
Marius Hobbhahn
1 month
@fleetingbits @jxmnop Yeah, lots of the environments in SWEBench are shared, so you need to choose environments that are small and shared across many instances. And on top of that you need to retain the original calibration across some metrics. LP was the simples way to do it IMO.
0
0
2
@MariusHobbhahn
Marius Hobbhahn
1 month
@dmkrash Yeah, I wanted to use their fancy Bayesian technique for my selection but then it was too annoying to set up and k-means kinda worked so I left it at that.
0
0
1