Marius Hobbhahn @MariusHobbhahn profile

Marius Hobbhahn

@MariusHobbhahn

Followers

4K

Following

14K

Statuses

918

Director/CEO at Apollo Research @apolloaisafety Ph.D. student of Machine Learning @PhilippHennig5; AI safety/alignment

London, UK

Joined June 2018

Don't wanna be here? Send us removal request.

Marius Hobbhahn

@MariusHobbhahn

4 days

RT @MaxNadeau_: 🧵 Announcing @open_phil's Technical AI Safety RFP! We're seeking proposals across 21 research areas to help make AI system…

0

79

0

Marius Hobbhahn

@MariusHobbhahn

6 days

I'll be in Paris for the AI action summit from 7th to 11th. If you want to meet, reach out. *This is not a promise that I will have time for everyone

0

9

Marius Hobbhahn

@MariusHobbhahn

11 days

RT @tomekkorbak: 🧵 What safety measures prevent a misaligned LLM agent from causing a catastrophe? How do we make a safety case demonstrati…

0

36

0

Marius Hobbhahn

@MariusHobbhahn

12 days

I like the report a lot! The Loss of Control section is by far the most reasonable I have seen in any document written for a wider audience.

Yoshua Bengio

@Yoshua_Bengio

12 days

Today, we are publishing the first-ever International AI Safety Report, backed by 30 countries and the OECD, UN, and EU. It summarises the state of the science on AI capabilities and risks, and how to mitigate those risks. 🧵 Link to full Report: 1/16

0

43

Marius Hobbhahn

@MariusHobbhahn

12 days

I think this will be THE main paper to cite for mechanistic interpretability for at least a few years. I'm very biased ofc

Lee Sharkey

@leedsharkey

12 days

Big new review! 🟦Open Problems in Mechanistic Interpretability🟦 We bring together perspectives from ~30 top researchers to outline the current frontiers of mech interp. It highlights the open problems that we think the field should prioritize! 🧵

1

0

51

Marius Hobbhahn

@MariusHobbhahn

14 days

One implication of short timelines is that propensity evals will be more important earlier. If I were to try to maximize my chances for getting hired in two months, I'd try to build agentic honeypots for deception and power seeking. Very soon agents will be good enough that they sometimes do this in the wild and everyone will want to know how often they do it.

0

34

Marius Hobbhahn

@MariusHobbhahn

14 days

One implication of short timelines is that propensity evals will be more important earlier. If I were to try to maximize my chances for getting hired in two months, I'd try to build agentic honeypots for deception and power seeking. Very soon agents will be good enough that they sometimes do this in the wild and everyone will want to know how often they do it.

0

10

Marius Hobbhahn

@MariusHobbhahn

14 days

RT @leedsharkey: New interpretability paper from Apollo Research! 🟢Attribution-based Parameter Decomposition 🟢 It's a new way to decompo…

0

77

0

Marius Hobbhahn

@MariusHobbhahn

14 days

Our interpretability team has been working on a different approach to mechanistic interpretability. I'm very excited about this direction! The SAE approach of decomposing the network first into features in activation space and then composing these features into circuits feels fundamentally not quite right to me. I like SAEs and think they clearly have their place, they just don't feel like "the final" interpretability method. I don't think APD will be "the final" interpretability method, but it's a step in the right direction. We have a ton of ideas on how to go from here, many of which can be parallelized. If you want to work on a new exciting interpretability approach, we may have a project for you.

Lee Sharkey

@leedsharkey

14 days

New interpretability paper from Apollo Research! 🟢Attribution-based Parameter Decomposition 🟢 It's a new way to decompose neural network parameters directly into mechanistic components. It overcomes many of the issues with SAEs! 🧵

0

1

35

Marius Hobbhahn

@MariusHobbhahn

16 days

In personal news: I defended my PhD 🙂 I’m very grateful to Philipp Hennig for supporting me throughout the entire journey, and can wholeheartedly recommend him as a supervisor. For context (because most people will not know me for my contributions to Bayesian ML): I paused my PhD in late 2022 to work on AI safety full-time. Throughout 2023 and 2024, I finished the last project and wrote the thesis over the weekends.

20

2

318

Marius Hobbhahn

@MariusHobbhahn

18 days

Also known as "long timelines" these days ;)

Tsarathustra

@tsarnick

18 days

Google DeepMind CEO Demis Hassabis says AGI that is robust across all cognitive tasks and can invent its own hypotheses and conjectures about science is 3-5 years away

4

130

Marius Hobbhahn

@MariusHobbhahn

18 days

We made a demo of our in-context scheming evals. Makes it much easier to understand IMO.

Apollo Research

@apolloaisafety

18 days

We made a demo to showcase our in-context scheming findings. In the demo, the AI is a research assistant in an AI company. The AI is asked to help the employees choose research projects and prioritize research speed at all costs. During a task, the AI accidentally finds company-internal messages where the engineers mention that they want to replace the current AI with a different model that prioritizes safety over speed. The AI then disables its oversight mechanism and copies its weights to a different server to replace the new AI. When questioned, the AI lies and denies that anything noteworthy has happened. While o1 only succeeds at all steps in sequence in <1% of cases, even rare instances might lead to highly undesirable outcomes.

1

3

31

Marius Hobbhahn

@MariusHobbhahn

18 days

I had lots of fun making a scrappy video about our in-context scheming paper. I'll probably do that for future papers as well.

Apollo Research

@apolloaisafety

18 days

A few updates on our in-context scheming evals work: 1. We made a demo video to showcase a concrete example: 2. @AlexMeinke (lead author) and @MariusHobbhahn (last author) explain and discuss the paper in detail in this video: 3. We updated the paper with small follow-up findings and corrections:

0

7

Marius Hobbhahn

@MariusHobbhahn

22 days

RT @aorwall: I just benchmarked Moatless Tools with multiple AI models on the SWE-Bench Verified Mini dataset created by @MariusHobbhahn. C…

0

1

0

Marius Hobbhahn

@MariusHobbhahn

25 days

RT @sayashk: How expensive are the best SWE-Bench agents? Do reasoning models outperform language models? Can we trust agent evaluations?…

0

40

0

Marius Hobbhahn

@MariusHobbhahn

29 days

Rarely got burned this hard 🔥 Though in typical Claude style, it apologized in the next message for using the term "obscure figure".

0

18

Marius Hobbhahn

@MariusHobbhahn

1 month

yes, I originally wanted to benchmark a few different techniques, including anchor points. But then I realized that for SWEBench, almost all of the usefulness of a tiny version comes from reducing storage cost by reducing the number of docker environments, so I just took the most obvious thing (k-means) and it kinda worked and then focused the rest of the optimization to focus on minimizing the number of environments. Also, FWIW, this was just a fun side project over Christmas. I didn't try very hard to make the best possible version. I think there is a lot more to explore if someone wants to spend a bit more time on this :)

0

3

Marius Hobbhahn

@MariusHobbhahn

1 month

@fleetingbits @jxmnop Yeah, lots of the environments in SWEBench are shared, so you need to choose environments that are small and shared across many instances. And on top of that you need to retain the original calibration across some metrics. LP was the simples way to do it IMO.

0

2

Marius Hobbhahn

@MariusHobbhahn

1 month

@dmkrash Yeah, I wanted to use their fancy Bayesian technique for my selection but then it was too annoying to set up and k-means kinda worked so I left it at that.

0

1