BoshiWang2 Profile Banner
Boshi Wang Profile
Boshi Wang

@BoshiWang2

Followers
1K
Following
228
Statuses
102

Fourth-year Ph.D. @OhioState. Prev intern @MSFTResearch

The Ohio State University
Joined May 2021
Don't wanna be here? Send us removal request.
@BoshiWang2
Boshi Wang
5 months
Can OpenAI o1 tackle hard reasoning problems? We tested it on the complex reasoning task in our Grokked Transformers paper. It turns out that o1-preview also struggles a lot like earlier LLMs; on the other hand, a grokked transformer can nail it near-perfectly.
Tweet media one
15
78
534
@BoshiWang2
Boshi Wang
19 days
RT @BoyuGouNLP: 🚀 UGround accepted to #ICLR2025 [scores=10/8/8/5]! 🎉 We’re also thrilled to share some exciting updates: ✨ UGround is SOTA…
0
24
0
@BoshiWang2
Boshi Wang
2 months
RT @BoyuGouNLP: With recent advancements like Claude 3.5 Computer Use and Gemini 2.0, the field of GUI Agents is rapidly evolving. 🚀 Excit…
0
19
0
@BoshiWang2
Boshi Wang
2 months
Will attend #NeurIPS2024 Dec 10-14 and present our work studying Transformer's grokking in implicit reasoning. Excited to meet old and new friends!
@hhsun1
Huan Sun (OSU)
9 months
Thanks @_akhaliq for sharing our work. Very proud to introduce my star student @BoshiWang2's new work @osunlp: Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization Can transformers reason? Are transformers fundamentally limited in compositionality and systematic generalization? I think our work’s findings can contribute to meaningful debates over these questions. Key findings (with mixed results): (1) Figure 1 below: We focus on two representative reasoning types: composition and comparison, and show that transformers can learn implicit reasoning, but only through grokking. The levels of generalization vary across reasoning types: when faced with OOD examples, transformers fail to systematically generalize for composition but succeed for comparison. (2) Figure 2 below: What happens during grokking? Why does grokking happen? Why does the model fail in OOD generalization for composition but succeed for comparison? We conduct a mechanistic analysis of the model's internals throughout training, and find: 1) the gradual formation of the generalizing circuit throughout grokking, and 2) the connection between systematicity and the circuit’s configuration, i.e., the way atomic knowledge and rules are stored and applied within the circuit: The comparison task emits a ``parallel circuit'' that is learned by the transformer during grokking, which allows atomic facts to be stored and retrieved in the same region and enables systematicity to happen. For composition, the model does acquire the composition rule through grokking, but it does not have any incentive to store atomic facts in the upper layers that do not appear as the second hop during training. Connections with existing research: 1. We find the speed of improvement in generalization correlates with the ratio between inferred and atomic facts in training (critical data distribution), and depends little on the absolute size of the training data. This seems to contradict the hypothesis of critical data size in prior work such as by Varma et al. 2. Our work provides a mechanistic understanding of existing findings that transformers seem to reduce compositional reasoning to linearized pattern matching (Dziri et al. @nouhadziri) and that LLMs show positive evidence in first-hop reasoning but not the second (Yang et al. @soheeyang_). 3. Why is implicit reasoning with parametric memory of knowledge and rules practically important? To show its potential, we demonstrate that on a complex reasoning task with a large search space, a fully grokked transformer can achieve near-perfect accuracy while GPT-4 Turbo and Gemin-1.5-Pro are close to random guessing. 😀Fun fact about the title: We went back and forth many times and created ~10 candidate titles. Another title I personally liked very much is “Grokking of Implicit Reasoning: What Happens Inside Transformers?” But that does not deliver our key conclusion and our mechanistic analysis approach. Finally, Boshi came up with this title, which sounds very romantic (although perhaps less scientific) to me, but captures most aspects of our paper very well. P.S. @wangboshi is truly intellectually stimulating to work with. If you have related internship/collaboration opportunities, feel free to reach out! Joint work with @xiangyue96 @ysu_nlp
Tweet media one
Tweet media two
2
5
28
@BoshiWang2
Boshi Wang
2 months
RT @xiangyue96: ✈️Flying to #NeurIPS2024 tmr! Excited to reconnect with old friends and meet new ones. I co-authored 6 papers at NeurIPS👇.…
0
59
0
@BoshiWang2
Boshi Wang
2 months
RT @AkariAsai: 🚨 I’m on the job market this year! 🚨 I’m completing my @uwcse Ph.D. (2025), where I identify and tackle key LLM limitations…
0
118
0
@BoshiWang2
Boshi Wang
3 months
RT @yugu_nlp: ❓Wondering how to scale inference-time compute with advanced planning for language agents? 🙋‍♂️Short answer: Using your LLM…
0
89
0
@BoshiWang2
Boshi Wang
3 months
RT @BotaoYu24: 🤔 Can LLMs with tools always outperform those without? Perhaps not... 🚀 In our new work, we introduce ChemAgent, an enhance…
0
23
0
@BoshiWang2
Boshi Wang
4 months
RT @ysu_nlp: People into agents, let me pitch something to you: 🌟 An agent that works across every platform (web, desktop & mobile) 🌟 Visu…
0
93
0
@BoshiWang2
Boshi Wang
4 months
RT @RonZiruChen: 🚀 Can language agents automate data-driven scientific discovery? Not yet. But we're making strides. Introducing **Science…
0
40
0
@BoshiWang2
Boshi Wang
4 months
RT @ShijieChen98: Is generation always the best way to use LLMs? 🤔 At least not for re-ranking! Excited to share our latest work: Attenti…
0
33
0
@BoshiWang2
Boshi Wang
5 months
RT @hhsun1: Our work that studies grokked transformers on reasoning and their generalization behaviors is accepted to #NeurIPS2024 @NeurIPS
0
10
0
@BoshiWang2
Boshi Wang
5 months
Thanks for the interest and comments! I think the relations don't need to be "injective", they just need to be functions (and hence map the given subject entity to a unique object entity). In other words, they are N-to-one relations. And yes the text is not clear enough around this and I'll for sure revise it. For comparison task, there's also always a unique answer determined by the two attribute values.
0
0
2
@BoshiWang2
Boshi Wang
5 months
@scychan_brains Yeah the results except grokked are similar. Note also that this is final answer accuracy; for models with CoT if we look into the reasoning then most of them are actually wrong (discussed a bit above) so overall they just fail pretty badly
1
1
0
@BoshiWang2
Boshi Wang
5 months
Thanks! Yeah we were also curious about why CoT sometimes hurts performance and looked a bit into it - it turns out that with CoT, there's a higher percentage of examples where the LLM ends up saying that the answer cannot be decided (which we treat as wrong since the answer can be decided). Intuitively what's going on should be that, when the model is verbalizing its reasoning in context, it usually knows better that the logic so far doesn’t really work out, and hence would admit that it couldn't make it.
0
0
5
@BoshiWang2
Boshi Wang
5 months
Also feel free to check out our paper on grokking of implicit reasoning in transformers:
4
4
70
@BoshiWang2
Boshi Wang
6 months
RT @hhsun1: My students/collaborators and I will present two papers in the poster sessions #ACL2024NLP on Tue/Wed: 1. "AttributionBench:…
0
11
0