![Wonmin Byeon Profile](https://pbs.twimg.com/profile_images/1242623128133459969/tzlJnA57_x96.jpg)
Wonmin Byeon
@wonmin_byeon
Followers
971
Following
121
Statuses
58
Here is our new 8B Mamba-based Hybrid LLM: Higher MMLU compared to the 8B transformer and long context extension up to 128K sequences.
A 8B-3.5T hybrid SSM model gets better accuracy than an 8B-3.5T transformer trained on the same dataset: * 7% attention, the rest is Mamba2 * MMLU jumps from 50 to 53.6% * Training efficiency is the same * Inference cost is much less
4
6
36
RT @rupspace: Wrote a post about Highway networks, ResNets and subtleties of architecture comparisons
0
40
0
RT @PavloMolchanov: 🚀 Introducing Hymba-1.5B: a new hybrid architecture for efficient small language models! ✅ Outperforms Llama, Qwen, an…
0
56
0
Our new hybrid model is out! Our Hymba-1.5B even outperforms LLaMA 3.2-3B. Check out the paper for more details.
Sharing our team’s latest work on Hymba - an efficient small language model with hybrid architecture. Tech report: Discover the tradeoff between Mamba and Attention, how they can be combined, how attention sink and forced-to-attend phenomena can be mitigated, and how KV cache can be shared across layers. Learn how we built a model with end-to-end ecosystem: data selection, architecture analysis and design, training Base and Instruct models and open them to the community. Did I mention that our Hymba-1.5B Base model outperforms LLaMA 3.2-3B while being trained on 7× fewer tokens and achieving 12× higher throughput? More details and model links come soon!
0
1
9
RT @rupspace: Interested in Discrete Diffusion? I've just released a Github repo where you can learn about and play with discrete diffusion…
0
20
0
I will give a talk at KAIST today (July 17th) at 5pm PDT. The talk is about Mamba-based models and the findings from our recent paper. Everyone is welcome to join! The Zoom link is below.
Excited to host a Zoom talk by Dr. Wonmin Byeon on her research with Nvidia colleagues on "An Alternative Architecture for Efficient Large Language Models (LLMs)" This will be on Zoom, July 17th 5 pm PDT (July 18th 9 am KST), Abstract: Widely used Large Language Models (LLMs) are based on Transformer architectures. While Transformer-based language models are highly parallelizable and can model massive amounts of data, they introduce significant computational overhead due to the quadratic self-attention calculations, especially on longer sequences. They also have large inference-time memory requirements from the key-value cache. More recently, State Space Models (SSM) like Mamba have been shown to have fast parallelizable training and inference. Studies show that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In this talk, I present the strengths and weaknesses of Mamba, Mamba-2, and Transformer models at larger scales. I also introduce a hybrid architecture consisting of Mamba-2, attention, and MLP layers. While pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks that require strong copying or in-context learning abilities. In contrast, the hybrid model closely matches or exceeds the Transformer on all standard and long-context tasks and is predicted to be up to 8x faster when generating tokens at inference time. Bio: Wonmin Byeon () is a senior research scientist at NVIDIA Research in Santa Clara, US. She received her Ph.D. in Computer Science from Technical University Kaiserslautern, Germany. During her Ph.D., she was a visiting researcher at IDSIA, Switzerland, working with Juergen Schmidhuber. She then joined as a post-doctoral researcher at IDSIA and ETH Zurich. Her research interests include Recurrent Neural Networks, State Space Models, and linear RNNs for temporal or spatio-temporal domains. 📷
1
1
16
w/ @RWaleffe, @DuncanARiach, @BrandonNor90881 , Vijay Korthikanti, @tri_dao @_albertgu @ahatamiz1, Sudhakar Singh, @deepakn94, Garvit Kulshreshtha, Vartika Singh, Jared Casper, @jankautz, @MohammadShoeybi, @ctnzr
0
0
2
ConvSSM: State Space Models for long videos 🎉 We finally released the code and the pretrained models. Code: Paper: @NVIDIAAI @jimmysmith1919
📢 Excited to share our work at #NeurIPS2023: ConvSSM, a powerful sequence model for long videos. poster: Tuesday at 5:15pm, Great Hall & Hall B1+B2 #705 (coming soon) Work done with @jimmysmith1919 @shalinidemello @jankautz 🧵👇
1
2
15