Liyuan Liu (Lucas) @LiyuanLucas profile

Liyuan Liu (Lucas)

@LiyuanLucas

Followers

787

Following

400

Statuses

142

Researcher @MSFTResearch | prev. @dmguiuc Working on deep learning heuristics (aka tricks) He/him

Redmond, WA

Joined October 2015

Don't wanna be here? Send us removal request.

Liyuan Liu (Lucas)

@LiyuanLucas

2 years

Thanks for sharing! Proud and Excited to release ReinMax🚀🚀🚀 Paper: Code: Package: `pip install reinmax` (Hope ChatGPT can really do this in the future)

AK

@_akhaliq

2 years

Bridging Discrete and Backpropagation: Straight-Through and Beyond abs:

3

9

33

Liyuan Liu (Lucas)

@LiyuanLucas

13 days

Agree 100%. Has been a great admirer of Anthropic, and when I read the blog, it's like... wow, even Anthropic lose their cool

Thomas Wolf

@Thom_Wolf

14 days

Finally took time to go over Dario's essay on DeepSeek and export control and to be honest it was quite painful to read. And I say this as a great admirer of Anthropic and big user of Claude* The first half of the essay reads like a lengthy attempt to justify that closed-source models are still significantly ahead of DeepSeek. However, it mostly refers to internal unpublished evals which limit the credit you can give it, and statements like « DeepSeek-V3 is close to SOTA models and stronger on some very narrow tasks » transforming in a general conclusion « DeepSeek-V3 is actually worse than those US frontier models — let’s say by ~2x on the scaling curve » left me generally doubtful. The same applies to the takeaway that all discoveries and efficiency improvements of DeepSeek have been discovered long ago by closed-models companies, this statement mostly resulting from a comparison of DeepSeek openly published $6M training numbers with some vague « few $10M » on Anthropic side without providing much more details. I have no doubts the Anthropic team is extremely talented and I’ve regularly shared how impressed I am with Sonnet 3.5 but this longwinded comparison of open research with vague closed research and undisclosed evals has left me less convinced of their lead than I was before I reading it. Even more frustrating was the second half of the essay which dive into the US-China race scenario and totally misses the point that the DeepSeek model is open-weights, and largely open-knowledge due to its detailed tech report (and feel free to follow Hugging Face’s open-r1 reproduction project for the remaining non-public part: the synthetic dataset). If both DeepSeek and Anthropic models had been closed source, yes the arm-race interpretation could have make sense but having one of the model freely widely available for download and with detailed scientific report renders the whole « close-source arm-race competition » argument artificial and unconvincing in my opinion. Here is the thing: open-source knows no border. Both in its usage and its creation. Every company in the world, be it in Europe, Africa, South-America or the USA can now directly download and use DeepSeek without sending data to a specific country (China for instance) or depending on a specific company or server for running the core part of its technology. And just like most open-source library in the world are typically built by contributors from all over the world, we’ve already seen several hundred derivative models on the Hugging Face hub created everywhere in the world by teams adapting the original model to their specific use cases and explorations. What's more, with the open-r1 reproduction and the DeepSeek paper, the coming months will clearly see many open-source reasoning models being released by teams from all over the world. Just today, two other teams, AllenAI in Seattle and Mistral in Paris both independently released open-source base models (Tülu and Small3) which are already challenging the new state-of-the-art (with AllenAI indicating that its Tülu model surpasses the performance of DeepSeek-V3). And the scope is even much broader than this geographical aspect. Here is the thing we don’t talk nearly enough about: open-source will be more and more essential for our… safety! As AI becomes central to our lives, resiliency will increasingly become a very important element of this technology. Today we’re dependent on internet access for almost everything. Without access to the internet, we lose all our social media/news feeds, can’t order a taxi, book a restaurant, or reach someone on WhatsApp. Now imagine an alternate world to ours where all the data transiting through the internet would have to go through a single company’s data centers. The day this company suffers a single outage, the whole world would basically stop spinning (picture the recent CrowdStrike outage magnified a millionfold). Soon, as AI assistants and AI technology permeate our whole life to simplify many of our online and offline tasks, we (and companies using AI) will start to depend more on more on this technology for our daily activities and we will similarly start to find annoying or even painful any downtime in these AI assistants from outages. The most optimal way to avoid future downtime situations will be to build resilience deep in our technological chain. Open-source has many advantages like shared training costs, tunability, control, ownership, privacy but one of its most fundamental virtue in the long term –as AI becomes deeply embedded in our world– will likely be its strong resilience. It is one of the most straightforward and cost-effective ways to easily distribute compute across many independent providers and to even run models locally and on device with minimal complexity. More than national prides and competitions, I think it’s time to start thinking globally about the challenges and social changes that AI will bring everywhere in the world. And open-source technology is likely our most important asset for safely transitioning to a resilient digital future where AI is integrated into all aspects of society. *Claude is my default LLM for complex coding. I also love its character with hesitations and pondering, like a prelude to the chain-of-thoughts of more recent reasoning models like DeepSeek generations.

0

11

Liyuan Liu (Lucas)

@LiyuanLucas

15 days

RT @ZeyuanAllenZhu: Totally disagree. DeepSeek has >= 4 IOI gold medalists from team China (each = multiple IOI golds in other countries) a…

0

158

0

Liyuan Liu (Lucas)

@LiyuanLucas

2 months

RT @xwang_lk: It is just so sad that the #NeurIPS2024 main conference ended with such a racist remark by a faculty when talking about ethi…

0

287

0

Liyuan Liu (Lucas)

@LiyuanLucas

2 months

RT @eagle_hz: 📢📈 I’m on the 2025 faculty job market! I've been incredibly grateful to work with inspiring advisors, mentors & peers. 💡My re…

0

26

0

Liyuan Liu (Lucas)

@LiyuanLucas

2 months

The bias and discrimination in the keynote are astonishing in speech and writing. The confrontation spoke my heart in such a clear way that I couldn't do it myself. Combat discrimination starts from voicing ourselves, and there are more to do

Jiao Sun

@sunjiao123sun_

2 months

Someone confronted on the spot, and they said “ Maybe there is one, maybe they are common, who knows what. I hope it was an outlier.” Even this explanation is full of implicit racial bias. See the full conv:

0

3

83

Liyuan Liu (Lucas)

@LiyuanLucas

2 months

RT @sunjiao123sun_: Mitigating racial bias from LLMs is a lot easier than removing it from humans! Can’t believe this happened at the bes…

0

905

0

Liyuan Liu (Lucas)

@LiyuanLucas

2 months

We will be hosting a coffee chat session at the Microsoft booth on Thursday, December 12, 2024, from 10:00 AM to 10:30 AM. Feel free to stop by to learn more about our internship opportunities! #NeurIPS2024

Liyuan Liu (Lucas)

@LiyuanLucas

2 months

Join Microsoft Research's Deep Learning team in Redmond as a Summer 2025 intern! 🎓 Apply at 📍 I'll be at #NeurIPS2024 next week - let's connect and chat! Please help us share this post in your networks : ) #DeepLearning #Internship #MSR

0

5

48

Liyuan Liu (Lucas)

@LiyuanLucas

2 months

Join Microsoft Research's Deep Learning team in Redmond as a Summer 2025 intern! 🎓 Apply at 📍 I'll be at #NeurIPS2024 next week - let's connect and chat! Please help us share this post in your networks : ) #DeepLearning #Internship #MSR

9

37

294

Liyuan Liu (Lucas)

@LiyuanLucas

3 months

RT @murefil: The ML team at @MSFTResearch Montréal 🍁 is hiring a Senior Researcher with a background in ML / NLP!!! Come work with us at t…

0

37

0

Liyuan Liu (Lucas)

@LiyuanLucas

4 months

RT @rlbarter: 📢 Tomorrow is the official release of @bbiinnyyuu and my book, Veridical Data Science, with @mitpress. I'm so excited to for…

0

8

0

Liyuan Liu (Lucas)

@LiyuanLucas

4 months

What surprises me the most: without finetuning LLM, it recognizes and analyzes the information from non-native embedding. AND it outperms soft prompt tuning. This had me rethinking what icl is & is capable of...

Yufan Zhuang

@yufan_zhuang

4 months

We already know language models can learn in-context from text. Now, we show that LLMs can perform in-context learning directly from continuous representations across any modality! All it takes is a lightweight projector to align these representations into the LLM’s space. Introducing Vector-ICL: it works on text, numbers, time-series, graphs, and even brain signals—outperforming ICL and domain-specific models. This research pushes the limits of what LLMs can do beyond traditional token-based learning! 👉 Check it out: Huge thanks to @csinva @LiyuanLucas @shangjingbo @JianfengGao0217 More details 👇

0

1

13

Liyuan Liu (Lucas)

@LiyuanLucas

4 months

RT @yufan_zhuang: We already know language models can learn in-context from text. Now, we show that LLMs can perform in-context learning d…

0

26

0

Liyuan Liu (Lucas)

@LiyuanLucas

4 months

@roydanroy Dont think neurips is a cv conference... It's like, although my neurips24 paper has experiments on MNIST vae, i wouldn't call it a computer vision work...

0

2

Liyuan Liu (Lucas)

@LiyuanLucas

5 months

RT @ylecun: @polynoamial @thomaspower @OpenAI I'm sorry Noam, but a blog post does not come close to meeting the standards of reproducibili…

0

36

0

Liyuan Liu (Lucas)

@LiyuanLucas

5 months

@roydanroy Maybe another option is TMLR, only submitted there once, but the review quality is really good

0

Liyuan Liu (Lucas)

@LiyuanLucas

5 months

@HealthyCode @_akhaliq Phi-3.5-MoE is also using the GRIN algorithm. If you actually read the paper, you may notice "Note a different version of mid-training and post-training, emphasizing long context and multilingual ability, has been conducted and has been released at "

0

Liyuan Liu (Lucas)

@LiyuanLucas

5 months

@Zapidroid @UmNanet @_akhaliq feel free to try it yourself at:

0

Liyuan Liu (Lucas)

@LiyuanLucas

5 months

Good question To put it simply, 5/8 = (1 + 1/4) / ((1 + 1/4) + (3/4)). 1 comes from the gating gradient, 1/4 is the coefficient from the Heun's third-order method, and 3/4 is the other coefficient from Heun's third-order method. In Appendix E in the ReinMax paper, I added a simple derivation for heun's second-order method [1]. For more backgrounds, you can search about RK methods [2]. 1. 2.

0

1

Liyuan Liu (Lucas)

@LiyuanLucas

5 months

This is the messy (but interesting) part of MoE... a. in practice, conventional MoE usually uses `topk` for routing directly (e.g., [1]). but its not deterministic from my perspective, given other randomness like jitter/dropout. E.g., if gumbel noise is added (instead of jitter/dropout), than its equivalent to softmax sampling. I will share more related observations on this in c. and d. b. the capacity constraint, etc. mentioned in the quoted paragraph, are related to token dropping. As to token dropping strategies, it is true that one popular strategy is to use the weight to decide which token to drop (e.g., [2]). However, such strategy also adds a strong regularization to router learning (similar to clip in ppo, from the high level) c. in the Switch Transformer paper, they have some comparisons on the sampling (Table 11 of [3] in the appendix). We have similar observations, i.e., sampling with jitter works, but sampling from softmax directly doesnt work at all. we constructed the maskedsoftmax to mimic the mechanism of the jitter noise as in P34 of [3]. d. if you compare the jitter noise in P34 of [3], you may find it to have a similar form with the gumbel reparameterization trick. Different from softmax / gumbel noise, the distributions w. jitter noise are different for z and z+c (z is the logits and c is a constant), which i believe is a nice property. I did some simulations on the distributions when sampling with jitter noise and observed the distribution is more similar to uniform when logits are small and is more similar to low-temperature softmax (near one-hot) when logits are large e. the released code and the paper of switch transformer add jitter noise in different ways. The paper (P34 of [3]) adds the jitter after the linear, while the code [4] adds the jitter before the linear. In small scales, we tried both, together with the MaskedSoftmax, all three doesn't seem to make much difference empirically 1. 2. 3. 4.

0

1

Liyuan Liu (Lucas)

@LiyuanLucas

5 months

RT @YouJiacheng: > Note a different version of mid-training and post-training, emphasizing long context and multilingual ability, has been…

0

1

0