Iván Arcuschin @IvanArcus profile

Iván Arcuschin

@IvanArcus

Followers

76

Following

6K

Statuses

23

Independent Researcher | AI Safety & Software Engineering

Argentina

Joined March 2011

Don't wanna be here? Send us removal request.

Iván Arcuschin

@IvanArcus

4 months

Our paper introducing InterpBench was accepted to @NeurIPSConf ! 🚀 Check it out if you want to know how we built a benchmark of semi-synthetic, realistic transformers with known circuits! 🔥 Congrats and thanks to my awesome co-authors @RohDGupta @Kwathomas0 @AdriGarriga

Iván Arcuschin

@IvanArcus

7 months

Circuit discovery techniques aim to find subgraphs of NNs for specific tasks. Are they correct? Which one is the best? 🕵️ Introducing InterpBench: 17 semi-synthetic, realistic transformers with known circuits to evaluate mechanistic interpretability. Read on... 🧵

2

3

21

Iván Arcuschin

@IvanArcus

2 months

@NeurIPSConf is almost here!! 🤩 InterpBench has been expanded to 86 models since our latest update! If you are interested in rigorous evaluation of Mech Interp techniques come chat with us! We'll be at Poster Session 5 East on Fri 13 Dec 11AM — 2PM

Iván Arcuschin

@IvanArcus

7 months

Circuit discovery techniques aim to find subgraphs of NNs for specific tasks. Are they correct? Which one is the best? 🕵️ Introducing InterpBench: 17 semi-synthetic, realistic transformers with known circuits to evaluate mechanistic interpretability. Read on... 🧵

0

3

Iván Arcuschin

@IvanArcus

4 months

@Butanium_ @sprice354_ @NeurIPSConf @RohDGupta @Kwathomas0 @AdriGarriga Right now all models have only one algorithmic task, so not much superposition, but we are looking to expand it for SAE evaluations! @evanhanders has done some great initial work in that direction:

0

1

Iván Arcuschin

@IvanArcus

6 months

RT @uit_bos: Circuits are supposed to explain how a model accomplishes a task. But do they really succeed at this? We evaluate three circu…

0

3

0

Iván Arcuschin

@IvanArcus

7 months

RT @farairesearch: Check out #ICML2024 posters by @MATSprogram scholars mentored by @AdriGarriga! July 26: NextGen AI Safety 💥Catastrophi…

0

4

0

Iván Arcuschin

@IvanArcus

7 months

Work done with my awesome collaborators: @RohDGupta, @Kwathomas0, @AdriGarriga Source code for InterpBench and experiments: For more details, check out our paper:

0

4

Iván Arcuschin

@IvanArcus

7 months

InterpBench marks a significant step forward in evaluating circuit discovery techniques, paving the way for more reliable and accurate methods. Our findings already challenge previous assumptions and offer new insights into the effectiveness of existing methods 🧑‍🔬

1

0

2

Iván Arcuschin

@IvanArcus

7 months

Previous evaluations compared circuit discovery techniques using only the average AUROC, and slightly favor Node SP over ACDC. In this work we use statistical tests and can confidently say that ACDC outperforms Node SP (p-value ≈ 0.0004) on InterpBench 😎

1

0

2

Iván Arcuschin

@IvanArcus

7 months

Now to what everyone was waiting for... We use InterpBench to evaluate 5 state-of-the-art circuit discovery techniques: Automatic Circuit DisCovery (ACDC), Subnetwork Probing (SP) on nodes and edges, Edge Attribution Patching (EAP), and EAP with integrated gradients 👀

1

0

2

Iván Arcuschin

@IvanArcus

7 months

SIIT models also have realistic weights and activations! 😁

1

0

3

Iván Arcuschin

@IvanArcus

7 months

To analyze realism, we check whether circuit discovery behaves similarly on SIIT and "natural" transformers trained with supervised learning. We see a consistently higher correlation of the circuit KL between SIIT and “natural” models than between Tracr and “natural” ones 👏

1

0

2

Iván Arcuschin

@IvanArcus

7 months

We extend IIT to correctly implement Tracr-generated circuits. Strict IIT (SIIT) improves on IIT by intervening on low-level nodes unmatched to high-level ones, ensuring the high-level model correctly represents the NN's circuit ✅

1

2

Iván Arcuschin

@IvanArcus

7 months

Still, IIT only intervenes on low-level nodes mapped to high-level ones. This creates accurate models where the circuit nodes do the computation, but other nodes not in the circuit may also affect the output. Thus, the circuit we choose isn’t the one the NN actually implements 😢

1

0

1

Iván Arcuschin

@IvanArcus

7 months

To get realistic transformers with known circuits, we could train them to follow a circuit with Interchange Intervention Training (IIT), which maps NN components to nodes in a high-level graph, applying interventions to both during training to incentivize similar behavior 🔄

1

0

2

Iván Arcuschin

@IvanArcus

7 months

However, Tracr transformers are unrealistic & differ from gradient descent-trained ones: most of their weights and activations are zero, none of their features are in superposition, and they use only a small portion of their activations for the task at hand ❌

1

0

2

Iván Arcuschin

@IvanArcus

7 months

Another way to evaluate circuit discovery methods is to compare them on NNs that have known circuits by construction. 💡 One such tool for doing that is Tracr, which compiles RASP programs into transformers, creating models with known algorithms.

1

0

Iván Arcuschin

@IvanArcus

7 months

Previous attempts to compare circuit discovery techniques used just a few NNs with manually-curated circuits. But how confident can we be that those NNs implement the claimed circuits? We don't know for sure! 😕

1

0

1