Iván Arcuschin Profile
Iván Arcuschin

@IvanArcus

Followers
76
Following
6K
Statuses
23

Independent Researcher | AI Safety & Software Engineering

Argentina
Joined March 2011
Don't wanna be here? Send us removal request.
@IvanArcus
Iván Arcuschin
4 months
Our paper introducing InterpBench was accepted to @NeurIPSConf ! 🚀 Check it out if you want to know how we built a benchmark of semi-synthetic, realistic transformers with known circuits! 🔥 Congrats and thanks to my awesome co-authors @RohDGupta @Kwathomas0 @AdriGarriga
@IvanArcus
Iván Arcuschin
7 months
Circuit discovery techniques aim to find subgraphs of NNs for specific tasks. Are they correct? Which one is the best? 🕵️ Introducing InterpBench: 17 semi-synthetic, realistic transformers with known circuits to evaluate mechanistic interpretability. Read on... 🧵
Tweet media one
2
3
21
@IvanArcus
Iván Arcuschin
2 months
@NeurIPSConf is almost here!! 🤩 InterpBench has been expanded to 86 models since our latest update! If you are interested in rigorous evaluation of Mech Interp techniques come chat with us! We'll be at Poster Session 5 East on Fri 13 Dec 11AM — 2PM
@IvanArcus
Iván Arcuschin
7 months
Circuit discovery techniques aim to find subgraphs of NNs for specific tasks. Are they correct? Which one is the best? 🕵️ Introducing InterpBench: 17 semi-synthetic, realistic transformers with known circuits to evaluate mechanistic interpretability. Read on... 🧵
Tweet media one
0
0
3
@IvanArcus
Iván Arcuschin
4 months
@Butanium_ @sprice354_ @NeurIPSConf @RohDGupta @Kwathomas0 @AdriGarriga Right now all models have only one algorithmic task, so not much superposition, but we are looking to expand it for SAE evaluations! @evanhanders has done some great initial work in that direction:
0
0
1
@IvanArcus
Iván Arcuschin
6 months
RT @uit_bos: Circuits are supposed to explain how a model accomplishes a task. But do they really succeed at this? We evaluate three circu…
0
3
0
@IvanArcus
Iván Arcuschin
7 months
RT @farairesearch: Check out #ICML2024 posters by @MATSprogram scholars mentored by @AdriGarriga! July 26: NextGen AI Safety 💥Catastrophi…
0
4
0
@IvanArcus
Iván Arcuschin
7 months
Work done with my awesome collaborators: @RohDGupta, @Kwathomas0, @AdriGarriga Source code for InterpBench and experiments: For more details, check out our paper:
0
0
4
@IvanArcus
Iván Arcuschin
7 months
InterpBench marks a significant step forward in evaluating circuit discovery techniques, paving the way for more reliable and accurate methods. Our findings already challenge previous assumptions and offer new insights into the effectiveness of existing methods 🧑‍🔬
1
0
2
@IvanArcus
Iván Arcuschin
7 months
Previous evaluations compared circuit discovery techniques using only the average AUROC, and slightly favor Node SP over ACDC. In this work we use statistical tests and can confidently say that ACDC outperforms Node SP (p-value ≈ 0.0004) on InterpBench 😎
1
0
2
@IvanArcus
Iván Arcuschin
7 months
Now to what everyone was waiting for... We use InterpBench to evaluate 5 state-of-the-art circuit discovery techniques: Automatic Circuit DisCovery (ACDC), Subnetwork Probing (SP) on nodes and edges, Edge Attribution Patching (EAP), and EAP with integrated gradients 👀
Tweet media one
1
0
2
@IvanArcus
Iván Arcuschin
7 months
SIIT models also have realistic weights and activations! 😁
Tweet media one
1
0
3
@IvanArcus
Iván Arcuschin
7 months
To analyze realism, we check whether circuit discovery behaves similarly on SIIT and "natural" transformers trained with supervised learning. We see a consistently higher correlation of the circuit KL between SIIT and “natural” models than between Tracr and “natural” ones 👏
Tweet media one
1
0
2
@IvanArcus
Iván Arcuschin
7 months
We extend IIT to correctly implement Tracr-generated circuits. Strict IIT (SIIT) improves on IIT by intervening on low-level nodes unmatched to high-level ones, ensuring the high-level model correctly represents the NN's circuit ✅
Tweet media one
1
1
2
@IvanArcus
Iván Arcuschin
7 months
Still, IIT only intervenes on low-level nodes mapped to high-level ones. This creates accurate models where the circuit nodes do the computation, but other nodes not in the circuit may also affect the output. Thus, the circuit we choose isn’t the one the NN actually implements 😢
1
0
1
@IvanArcus
Iván Arcuschin
7 months
To get realistic transformers with known circuits, we could train them to follow a circuit with Interchange Intervention Training (IIT), which maps NN components to nodes in a high-level graph, applying interventions to both during training to incentivize similar behavior 🔄
Tweet media one
1
0
2
@IvanArcus
Iván Arcuschin
7 months
However, Tracr transformers are unrealistic & differ from gradient descent-trained ones: most of their weights and activations are zero, none of their features are in superposition, and they use only a small portion of their activations for the task at hand ❌
Tweet media one
1
0
2
@IvanArcus
Iván Arcuschin
7 months
Another way to evaluate circuit discovery methods is to compare them on NNs that have known circuits by construction. 💡 One such tool for doing that is Tracr, which compiles RASP programs into transformers, creating models with known algorithms.
1
0
0
@IvanArcus
Iván Arcuschin
7 months
Previous attempts to compare circuit discovery techniques used just a few NNs with manually-curated circuits. But how confident can we be that those NNs implement the claimed circuits? We don't know for sure! 😕
1
0
1