Boshi Wang @BoshiWang2 profile

Boshi Wang

@BoshiWang2

Followers

1K

Following

228

Statuses

102

Fourth-year Ph.D. @OhioState. Prev intern @MSFTResearch

The Ohio State University

Joined May 2021

Don't wanna be here? Send us removal request.

Boshi Wang

@BoshiWang2

5 months

Can OpenAI o1 tackle hard reasoning problems? We tested it on the complex reasoning task in our Grokked Transformers paper. It turns out that o1-preview also struggles a lot like earlier LLMs; on the other hand, a grokked transformer can nail it near-perfectly.

15

78

534

Boshi Wang

@BoshiWang2

19 days

RT @BoyuGouNLP: 🚀 UGround accepted to #ICLR2025 [scores=10/8/8/5]! 🎉 We’re also thrilled to share some exciting updates: ✨ UGround is SOTA…

0

24

0

Boshi Wang

@BoshiWang2

2 months

RT @BoyuGouNLP: With recent advancements like Claude 3.5 Computer Use and Gemini 2.0, the field of GUI Agents is rapidly evolving. 🚀 Excit…

0

19

0

Boshi Wang

@BoshiWang2

2 months

Will attend #NeurIPS2024 Dec 10-14 and present our work studying Transformer's grokking in implicit reasoning. Excited to meet old and new friends!

Huan Sun (OSU)

@hhsun1

9 months

Thanks @_akhaliq for sharing our work. Very proud to introduce my star student @BoshiWang2's new work @osunlp: Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization Can transformers reason? Are transformers fundamentally limited in compositionality and systematic generalization? I think our work’s findings can contribute to meaningful debates over these questions. Key findings (with mixed results): (1) Figure 1 below: We focus on two representative reasoning types: composition and comparison, and show that transformers can learn implicit reasoning, but only through grokking. The levels of generalization vary across reasoning types: when faced with OOD examples, transformers fail to systematically generalize for composition but succeed for comparison. (2) Figure 2 below: What happens during grokking? Why does grokking happen? Why does the model fail in OOD generalization for composition but succeed for comparison? We conduct a mechanistic analysis of the model's internals throughout training, and find: 1) the gradual formation of the generalizing circuit throughout grokking, and 2) the connection between systematicity and the circuit’s configuration, i.e., the way atomic knowledge and rules are stored and applied within the circuit: The comparison task emits a ``parallel circuit'' that is learned by the transformer during grokking, which allows atomic facts to be stored and retrieved in the same region and enables systematicity to happen. For composition, the model does acquire the composition rule through grokking, but it does not have any incentive to store atomic facts in the upper layers that do not appear as the second hop during training. Connections with existing research: 1. We find the speed of improvement in generalization correlates with the ratio between inferred and atomic facts in training (critical data distribution), and depends little on the absolute size of the training data. This seems to contradict the hypothesis of critical data size in prior work such as by Varma et al. 2. Our work provides a mechanistic understanding of existing findings that transformers seem to reduce compositional reasoning to linearized pattern matching (Dziri et al. @nouhadziri) and that LLMs show positive evidence in first-hop reasoning but not the second (Yang et al. @soheeyang_). 3. Why is implicit reasoning with parametric memory of knowledge and rules practically important? To show its potential, we demonstrate that on a complex reasoning task with a large search space, a fully grokked transformer can achieve near-perfect accuracy while GPT-4 Turbo and Gemin-1.5-Pro are close to random guessing. 😀Fun fact about the title: We went back and forth many times and created ~10 candidate titles. Another title I personally liked very much is “Grokking of Implicit Reasoning: What Happens Inside Transformers?” But that does not deliver our key conclusion and our mechanistic analysis approach. Finally, Boshi came up with this title, which sounds very romantic (although perhaps less scientific) to me, but captures most aspects of our paper very well. P.S. @wangboshi is truly intellectually stimulating to work with. If you have related internship/collaboration opportunities, feel free to reach out! Joint work with @xiangyue96 @ysu_nlp

2

5

28

Boshi Wang

@BoshiWang2

2 months

RT @xiangyue96: ✈️Flying to #NeurIPS2024 tmr! Excited to reconnect with old friends and meet new ones. I co-authored 6 papers at NeurIPS👇.…

0

59

0

Boshi Wang

@BoshiWang2

2 months

RT @AkariAsai: 🚨 I’m on the job market this year! 🚨 I’m completing my @uwcse Ph.D. (2025), where I identify and tackle key LLM limitations…

0

118

0

Boshi Wang

@BoshiWang2

3 months

RT @yugu_nlp: ❓Wondering how to scale inference-time compute with advanced planning for language agents? 🙋‍♂️Short answer: Using your LLM…

0

89

0

Boshi Wang

@BoshiWang2

3 months

RT @BotaoYu24: 🤔 Can LLMs with tools always outperform those without? Perhaps not... 🚀 In our new work, we introduce ChemAgent, an enhance…

0

23

0

Boshi Wang

@BoshiWang2

4 months

RT @ysu_nlp: People into agents, let me pitch something to you: 🌟 An agent that works across every platform (web, desktop & mobile) 🌟 Visu…

0

93

0

Boshi Wang

@BoshiWang2

4 months

RT @RonZiruChen: 🚀 Can language agents automate data-driven scientific discovery? Not yet. But we're making strides. Introducing **Science…

0

40

0

Boshi Wang

@BoshiWang2

4 months

RT @ShijieChen98: Is generation always the best way to use LLMs? 🤔 At least not for re-ranking! Excited to share our latest work: Attenti…

0

33

0

Boshi Wang

@BoshiWang2

5 months

RT @hhsun1: Our work that studies grokked transformers on reasoning and their generalization behaviors is accepted to #NeurIPS2024 @NeurIPS…

0

10

0

Boshi Wang

@BoshiWang2

5 months

Thanks for the interest and comments! I think the relations don't need to be "injective", they just need to be functions (and hence map the given subject entity to a unique object entity). In other words, they are N-to-one relations. And yes the text is not clear enough around this and I'll for sure revise it. For comparison task, there's also always a unique answer determined by the two attribute values.

0

2

Boshi Wang

@BoshiWang2

5 months

@scychan_brains Yeah the results except grokked are similar. Note also that this is final answer accuracy; for models with CoT if we look into the reasoning then most of them are actually wrong (discussed a bit above) so overall they just fail pretty badly

1

0

Boshi Wang

@BoshiWang2

5 months

Thanks! Yeah we were also curious about why CoT sometimes hurts performance and looked a bit into it - it turns out that with CoT, there's a higher percentage of examples where the LLM ends up saying that the answer cannot be decided (which we treat as wrong since the answer can be decided). Intuitively what's going on should be that, when the model is verbalizing its reasoning in context, it usually knows better that the logic so far doesn’t really work out, and hence would admit that it couldn't make it.

0

5

Boshi Wang

@BoshiWang2

5 months

Also feel free to check out our paper on grokking of implicit reasoning in transformers:

4

70

Boshi Wang

@BoshiWang2

6 months

RT @hhsun1: My students/collaborators and I will present two papers in the poster sessions #ACL2024NLP on Tue/Wed: 1. "AttributionBench:…

0

11

0