![Boshi Wang Profile](https://pbs.twimg.com/profile_images/1573913186629197825/vQEKxcyt_x96.jpg)
Boshi Wang
@BoshiWang2
Followers
1K
Following
228
Statuses
102
Fourth-year Ph.D. @OhioState. Prev intern @MSFTResearch
The Ohio State University
Joined May 2021
RT @BoyuGouNLP: 🚀 UGround accepted to #ICLR2025 [scores=10/8/8/5]! 🎉 We’re also thrilled to share some exciting updates: ✨ UGround is SOTA…
0
24
0
RT @BoyuGouNLP: With recent advancements like Claude 3.5 Computer Use and Gemini 2.0, the field of GUI Agents is rapidly evolving. 🚀 Excit…
0
19
0
Will attend #NeurIPS2024 Dec 10-14 and present our work studying Transformer's grokking in implicit reasoning. Excited to meet old and new friends!
Thanks @_akhaliq for sharing our work. Very proud to introduce my star student @BoshiWang2's new work @osunlp: Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization Can transformers reason? Are transformers fundamentally limited in compositionality and systematic generalization? I think our work’s findings can contribute to meaningful debates over these questions. Key findings (with mixed results): (1) Figure 1 below: We focus on two representative reasoning types: composition and comparison, and show that transformers can learn implicit reasoning, but only through grokking. The levels of generalization vary across reasoning types: when faced with OOD examples, transformers fail to systematically generalize for composition but succeed for comparison. (2) Figure 2 below: What happens during grokking? Why does grokking happen? Why does the model fail in OOD generalization for composition but succeed for comparison? We conduct a mechanistic analysis of the model's internals throughout training, and find: 1) the gradual formation of the generalizing circuit throughout grokking, and 2) the connection between systematicity and the circuit’s configuration, i.e., the way atomic knowledge and rules are stored and applied within the circuit: The comparison task emits a ``parallel circuit'' that is learned by the transformer during grokking, which allows atomic facts to be stored and retrieved in the same region and enables systematicity to happen. For composition, the model does acquire the composition rule through grokking, but it does not have any incentive to store atomic facts in the upper layers that do not appear as the second hop during training. Connections with existing research: 1. We find the speed of improvement in generalization correlates with the ratio between inferred and atomic facts in training (critical data distribution), and depends little on the absolute size of the training data. This seems to contradict the hypothesis of critical data size in prior work such as by Varma et al. 2. Our work provides a mechanistic understanding of existing findings that transformers seem to reduce compositional reasoning to linearized pattern matching (Dziri et al. @nouhadziri) and that LLMs show positive evidence in first-hop reasoning but not the second (Yang et al. @soheeyang_). 3. Why is implicit reasoning with parametric memory of knowledge and rules practically important? To show its potential, we demonstrate that on a complex reasoning task with a large search space, a fully grokked transformer can achieve near-perfect accuracy while GPT-4 Turbo and Gemin-1.5-Pro are close to random guessing. 😀Fun fact about the title: We went back and forth many times and created ~10 candidate titles. Another title I personally liked very much is “Grokking of Implicit Reasoning: What Happens Inside Transformers?” But that does not deliver our key conclusion and our mechanistic analysis approach. Finally, Boshi came up with this title, which sounds very romantic (although perhaps less scientific) to me, but captures most aspects of our paper very well. P.S. @wangboshi is truly intellectually stimulating to work with. If you have related internship/collaboration opportunities, feel free to reach out! Joint work with @xiangyue96 @ysu_nlp
2
5
28
RT @xiangyue96: ✈️Flying to #NeurIPS2024 tmr! Excited to reconnect with old friends and meet new ones. I co-authored 6 papers at NeurIPS👇.…
0
59
0
RT @AkariAsai: 🚨 I’m on the job market this year! 🚨 I’m completing my @uwcse Ph.D. (2025), where I identify and tackle key LLM limitations…
0
118
0
RT @yugu_nlp: ❓Wondering how to scale inference-time compute with advanced planning for language agents? 🙋♂️Short answer: Using your LLM…
0
89
0
RT @BotaoYu24: 🤔 Can LLMs with tools always outperform those without? Perhaps not... 🚀 In our new work, we introduce ChemAgent, an enhance…
0
23
0
RT @ysu_nlp: People into agents, let me pitch something to you: 🌟 An agent that works across every platform (web, desktop & mobile) 🌟 Visu…
0
93
0
RT @RonZiruChen: 🚀 Can language agents automate data-driven scientific discovery? Not yet. But we're making strides. Introducing **Science…
0
40
0
RT @ShijieChen98: Is generation always the best way to use LLMs? 🤔 At least not for re-ranking! Excited to share our latest work: Attenti…
0
33
0
RT @hhsun1: Our work that studies grokked transformers on reasoning and their generalization behaviors is accepted to #NeurIPS2024 @NeurIPS…
0
10
0
Thanks for the interest and comments! I think the relations don't need to be "injective", they just need to be functions (and hence map the given subject entity to a unique object entity). In other words, they are N-to-one relations. And yes the text is not clear enough around this and I'll for sure revise it. For comparison task, there's also always a unique answer determined by the two attribute values.
0
0
2
@scychan_brains Yeah the results except grokked are similar. Note also that this is final answer accuracy; for models with CoT if we look into the reasoning then most of them are actually wrong (discussed a bit above) so overall they just fail pretty badly
1
1
0
Thanks! Yeah we were also curious about why CoT sometimes hurts performance and looked a bit into it - it turns out that with CoT, there's a higher percentage of examples where the LLM ends up saying that the answer cannot be decided (which we treat as wrong since the answer can be decided). Intuitively what's going on should be that, when the model is verbalizing its reasoning in context, it usually knows better that the logic so far doesn’t really work out, and hence would admit that it couldn't make it.
0
0
5
RT @hhsun1: My students/collaborators and I will present two papers in the poster sessions #ACL2024NLP on Tue/Wed: 1. "AttributionBench:…
0
11
0