Neural Magic (Acquired by Red Hat)
@neuralmagic
Followers
6K
Following
1K
Statuses
1K
We are on a mission to bring #opensource LLMs and vLLM to every enterprise on the planet. Join our bi-weekly vLLM office hours: https://t.co/fUVhNQ1dhs
Boston, MA
Joined May 2018
🚀 We're sharing monthly vLLM newsletters with updates, insights, and events for the community! 📖 Check out the January edition here: Highlights this month: 📅 vLLM Office Hours: Distributed Inference (Jan 23) 📝 Blogs: Structured Decoding & XGrammar, 2024 Retrospective & 2025 Vision, Installing and Developing vLLM with Ease, and 2:4 Sparse Llama FP8 📍 Meetups: West Coast (Jan 22) & East Coast (Mar 11) 🌟 Red Hat’s acquisition of Neural Magic Want it delivered to your inbox? 📩 Sign up at the bottom of the page!
0
4
3
RT @vllm_project: v0.7.2 is released! Featuring 🖼️ @Alibaba_Qwen Qwen2.5-VL, 🤗 @huggingface Transformers backend, and several @deepseek_ai…
0
50
0
RT @ishapuri101: [1/x] can we scale small, open LMs to o1 level? Using classical probabilistic inference methods, YES! Joint @MIT_CSAIL / @…
0
56
0
Today, Thursday, at 2:00PM ET, @rogerw0108 will cover how the team drove enhanced support for multimodal LLMs with vLLM v1. Join us to learn and ask questions:
0
2
4
RT @kernelcdub: Here's how we're achieving R1 like reasoning with small models leveraging probabalistic inference-time scaling w/out using…
0
9
0
We’ll share a deep dive on the vLLM production stack during our bi-weekly vLLM office hours on March 6th. Register via the link in our bio.
How do you currently deploy open LLMs? With @vllm_project, with @kubernetesio? vLLM production-stack is an new open-source batteries included reference implementation from the vLLM project that extends vLLM to production use. 👀 TL;DR: 🔄 Simple cluster deployment with Helm charts, including @grafana Labs, Prometheus 📊 Provides real-time insights into system health with metrics like TTFT, TBT, and throughput in Grafana 🦙Uses vLLM to easily deploy, Llama, Qwen, Gemma, Mistral 🔌 Drop-in replacement for @OpenAI API with router to support multiple models ⚡️Up 3-10x lower response delay and 2-5x higher throughput compared to alternatives 📈 KV Cache sharing powered by the LMCache 🤗 Part of the vLLM Project and open source 🔜 Prefix-aware routing automatically sends queries to nodes with relevant context 🔜 Autoscaling based on vLLM-specific metrics, e.g. throughput
0
0
12
Join @pillar_vc in celebrating Neural Magic's acquisition by @RedHat , with a fireside chat featuring founders Nir Shavit (@nir_shavit) and Alex Matveev, CEO Brian Stevens (@addvin), CSAIL Director Daniela Rus, and Pillar VC's Jamie Goldstein (@jamieagoldstein)! The founders will share their journey from @MIT_CSAIL in 2018 to developing groundbreaking AI technology. After the discussion, attendees can network with the MIT community over food and drinks. RSVP here to attend:
0
1
7
New blog: Discover how DeepSeek models achieve better performance and scalability with multi-head latent attention (MLA) and FP8 optimizations in @vllm_project. Quick summary: 📈 Enhanced Performance: DeepSeek models see up to 3x throughput and 10x memory capacity improvements with MLA and FP8 kernel optimizations in vLLM v0.7.1. 🧠 Scalable Long-Context Inference: Optimized memory boosts token capacity from 54,560 to 512,000, enabling horizontal scalability with pipeline parallelism. 🛠️ New Innovations: MLA’s "matrix absorption" algorithm and other optimizations reduce memory usage while improving efficiency for complex, high-batch workloads. Read the full story:
0
4
15
.@RedHat AI Innovation team just dropped a new research paper on inference-time scaling! 🚨 All built on @vllm_project. Paper and code here: Cheers to paper authors @variational_i, @xukai92, @GX_NLP, Shivchander Sudalairaj, and @ishapuri101!
1
9
15
@NVIDIAAIDev @_philschmid we'd love your opinion on our recent findings considering you posted a detailed opinion regarding a similar post from October 2024:
0
0
2
RT @_EldarKurtic: How well do quantized models handle long-context tasks? When we released the "Give Me BF16 or Give Me Death?" paper, the…
0
4
0
RT @mgoin_: Come learn how optimal multimodal inference is achieved in @vllm_project with an architecture deep-dive this Thursday! https://…
0
3
0
How does vLLM v1 enhance multimodal LLM support? Join our office hours with @rogerw0108 (Sr. ML Engineer @Roblox) to learn about architectural changes, caching improvements, benchmarks, + more! @mgoin_ will also share a v1 update! 📅 Feb 6 | 2PM ET 🔗
0
2
8
RT @vllm_project: We landed the 1st batch of enhancements to the @deepseek_ai models, starting MLA and cutlass fp8 kernels. Compared to v0.…
0
106
0
Want the full breakdown? Check out the blog post for all the details: Try the models on @huggingface: Join our upcoming vLLM office hours to learn more: 🙏 @shubhrapandit, Alex Marques, @markurtz_ 🙏
0
0
3