Haihao Shen Profile Banner
Haihao Shen Profile
Haihao Shen

@HaihaoShen

Followers
2,944
Following
2,710
Media
40
Statuses
436

Creator of Intel Neural Compressor/Speed/Coder, Intel Ext. for Transformers, AutoRound; HF Optimum-Intel Maintainer; Founding member of OPEA; Opinions my own

Shanghai
Joined September 2021
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@HaihaoShen
Haihao Shen
1 day
🔥Great to share with you that @HabanaLabs Gaudi accelerator has been supported in AutoGPTQ. 🎯PR: Congrats Danny and Habana team! Thanks to fxmarty! ⏰Intel Neural Compressor will be fully supporting the model compression for Gaudi soon. Stay tuned!
0
10
41
@HaihaoShen
Haihao Shen
8 months
🔥Excited to share our NeurIPS'23 paper on Efficient LLM inference on CPUs! Compatible with GGML yet better performance up to 1.5x over llama.cpp! 📢Paper: 📕Code: #oneapi @intel @huggingface @_akhaliq @MosheWasserblat
Tweet media one
7
108
659
@HaihaoShen
Haihao Shen
6 months
📢Just change the model name, you can run LLMs blazingly fast on your PC using Intel Extension for Transformers powered by SOTA low-bit quantization! 🎯Code: , supporting Mistral, Llama2, Mixtral-MOE, Phi2, Solar, most recent LLMs. 🤗
4
58
321
@HaihaoShen
Haihao Shen
1 year
🎯We released GPT-J-6B INT8 ONNX models (first time for INT8 ONNX LLM❓) with ~4x model size reduction while preserving ~99.9% accuracy of FP32 baseline. 🔥GPT-J-6B INT8 models are now publicly available at Hugging Face model hub!
6
47
280
@HaihaoShen
Haihao Shen
8 months
🚀Accelerate LLM inference on your laptop, again CPU! Up to 4x on Intel i7-12900 over llama.cpp! 🎯Code: 📢Chatbot demo on PC: ; Hugging Face space demo locally: #oneapi @intel @huggingface @_akhaliq @Gradio
Tweet media one
9
56
274
@HaihaoShen
Haihao Shen
2 months
🔥Wanted to get the best low-bit LLM? Yes, we released a dedicated low-bit open LLM leaderboard for AIPC: , inspired from @huggingface LLM leaderboard! #intelai #inc #GPTQ #AWQ #GGUF @humaabidi @lvkaokao @_akhaliq @ollama @martin_casado @jeremyphoward
11
79
258
@HaihaoShen
Haihao Shen
6 months
📢Thrilled to announce Intel Extension for Transformers v1.3 released, featuring 1) efficient low-bit inference and fine-tuning, and 2) improved open-source chatbot framework Neural Chat. 👨‍💻Notes: 🤗Code: X'mas and Happy New Year!
2
40
180
@HaihaoShen
Haihao Shen
6 months
🤗Intel Extension for Transformers supports Mixtral-8-7B with 8-bit and 4-bit inference optimizations on Intel platforms! Start from CPUs🚀 🙌Don't hesitate to give a try. Sample code below👇 🎯Project: #iamintel #intelai @intel @huggingface
Tweet media one
5
41
232
@HaihaoShen
Haihao Shen
3 months
⚡️AutoRound, new SOTA LLM low-bit quantization approach developed by Intel Neural Compressor team () 🎯Lots of interesting comparison with GPTQ, AWQ, HQQ, etc. Check out the blog for more details: @huggingface #IAmIntel
4
53
217
@HaihaoShen
Haihao Shen
9 months
🔥INT4 whisper family models are out! Powered by Intel Extension for Transformers and INC! @YuwenZhou167648 @mengniwa @huggingface
8
41
207
@HaihaoShen
Haihao Shen
7 months
🚀 NeuralChat-7B-v3-1 continues ranking #1 in @huggingface open 7B LLM leaderboard! Even INT8 model ranked #3 !! 🤗Check out the leaderboard: 🥇Model: #iamintel #intelai #oneapi @intelai @lvkaokao
Tweet media one
4
33
203
@HaihaoShen
Haihao Shen
8 months
📢We are hiring full-time interns for LLM-based workflow development (e.g., retrieval-augmented generation for domain chatbot, co-pilot assistant, ...) 📷Location: Shanghai (or working remote in PRC) 🎯Project: If you have interests, DM with your resume.😀
4
30
199
@HaihaoShen
Haihao Shen
7 months
♥️ Happy Thanksgiving! Thanks to my family, friends, colleagues, partners, collaborators! Love you all!! 🔥We released QLoRA for CPU, to help you enable fine-tune LLMs on your laptop! See below👇 📢Code: #deeplearning #intelai #GenAI @intel @huggingface
3
42
194
@HaihaoShen
Haihao Shen
8 months
🔥Wanted to quantize 100B+ model on your laptop with 16GB memory? Hmmm, GPTQ does not work... 🎯Intel Neural Compressor supports layer-wise quantization, unlocking LLM quantization on your laptop! Up to 1000B model❓ 📕Blog: #oneapi @intel @huggingface
8
36
197
@HaihaoShen
Haihao Shen
3 months
🚀Share with you a nice blog "llama.cpp + Intel GPUs". Congrats to the awesome team especially Jianyu, Hengyu, Yu, and Abhilash, and thanks to @ggerganov for your great support. 📢Check out the blog: 🎯WIP with ollama now #iamintel #llama @ollama
2
47
190
@HaihaoShen
Haihao Shen
8 months
📢Do you want to make your LLM inference fast, accurate, and infinite (up to M tokens)? Here is the improved StreamingLLM with re-evaluate and shift-RoPE-K support on CPUs! 🔥Code: 📕Doc: #oneapi @intel @huggingface @Guangxuan_Xiao
Tweet media one
1
39
184
@HaihaoShen
Haihao Shen
5 months
🔥llama.cpp officially supports Sycl, showing promising perf gains over OpenCL. Give a shot on Intel GPUs e.g., Arc 770! PR: Congrats Abhilash/Jianyu/Hengyu/Yu! Thanks @ggerganov for the review! Transformer-like API soon in @RajaXg
5
39
180
@HaihaoShen
Haihao Shen
11 days
🔥Wanted to quantize LLMs with best accuracy & smallest size, Intel Neural Compressor is your choice. We just released v2.6 featuring SOTA LLM quantizer, outperforming GPTQ/AWQ on typical LLMs. 🎯Quantized LLM leaderboard: Github:
1
38
184
@HaihaoShen
Haihao Shen
6 months
🤗Intel Extension for Transformers enables running microsoft/phi-2 smoothly on laptop (faster than human speed🚀). Sample code👇 🎯Code: . Try and have funs! 🎁DM your favorite LLM. Next will be Solar :) #iamintel #intelai @intel @huggingface @murilocurti
Tweet media one
4
26
176
@HaihaoShen
Haihao Shen
1 month
🎯Intel Extension for Transformers (powered by Neural Speed) now supports GGUF with compatible API to Hugging Face Transformers yet blazing fast (up to 50x?) for LLM inference on AIPC (even on CPU cores) 🔥Repo: (PS: some friends called "open-source Groq")
Tweet media one
8
42
164
@HaihaoShen
Haihao Shen
7 months
🔥Excited to share a nice blog from @andysingal about Top-performance 7B LLM NeuralChat-v3-1 from Intel: . Check out the blog and have a try on this model! ⚡️ #IAmIntel #intelai @intel @huggingface
5
25
168
@HaihaoShen
Haihao Shen
7 months
📢Just created an open-source project to speed up LLMs dedicatedly 🌟Project: 🤗Look forward to your suggestions and let me know the topics that you may have interests and want to see. #LLM @intel @huggingface
4
26
168
@HaihaoShen
Haihao Shen
7 months
📢Continue making LLMs more accessible! Neural Compressor supports layer-wise GPTQ for INT4 quantization up to 1TB ~ 10TB (though not open-sourced yet) even on consumer HW! 📕Instruction: 🌟Project: #oneapi @intel @huggingface #LLM
1
24
165
@HaihaoShen
Haihao Shen
5 months
🚀Intel Extension for Transformers accelerates GGUF models now! GGUF is the new format introduced by llama.cpp🎆 🤗Project: #intelai #itrex #inc #gguf @intel @huggingface
Tweet media one
1
34
163
@HaihaoShen
Haihao Shen
3 months
🚀Thrilled to announce that NeuralSpeed v1.0 alpha is released! Highly optimized INT4 kernels and blazing fast LLM inference on CPUs! 🎯Integrated by ONNX Runtime; WIP: contribute to AutoAWQ @casper_hansen_ and AutoGPTQ 📔 Blog: 🔥
6
32
159
@HaihaoShen
Haihao Shen
7 months
🎯Excited to share another NeurIPS'23 paper titled "Effective Quantization for Diffusion Models on CPUs"! Congrats to all the collaborators! 🚀Code: 📜Paper: #iamintel #intelai @intel @huggingface @_akhaliq
1
35
151
@HaihaoShen
Haihao Shen
6 months
🎁Thrilled to share Intel Neural Compressor v2.4 is out on a nice snowy day in SH, a special release for model quantization/compression for LLMs, helping to bring AI everywhere. 👨‍💻Release notes: 🚀Code: #iamintel #intelai #oneapi
1
30
149
@HaihaoShen
Haihao Shen
5 months
🎯 #1 INT4 LLM algorithm: AutoRound invented by @intel , showing SOTA accuracy in Mixtral-8x7B, Phi2, NeuralChat ... 🚀 #1 INT4 LLM inference: Intel Extension for Transformers, running efficiently on Intel devices 🌟 🤗
6
26
147
@HaihaoShen
Haihao Shen
6 months
🎁Happy New Year! We released Intel Neural Compressor v2.4.1 on the last working day in 2023! 📔Release notes: 🎯Code: 🩷Thanks to everyone who has provided your support & help to INC. We are committed to make it better in 2024! 🤗
1
22
138
@HaihaoShen
Haihao Shen
4 months
📽️Editing LLM knowledge is possible, e.g., Rank-One Model Editing (ROME). 📔Paper: 🎯Sample code: 💣The technology behind looks interesting and useful, which is supposed to work with SFT and RAG to reduce the hallucination!
3
26
126
@HaihaoShen
Haihao Shen
5 months
🎯Quantization + Speculative decoding shows significant speedup up to 7.3x on Xeon using Intel AI SWs: 📢IPEX: ITREX: 🤗Blog: Congrats to @IntelAI and @huggingface team! @MosheWasserblat @humaabidi
2
26
128
@HaihaoShen
Haihao Shen
3 months
🤗NeuralChat beats GPT4 and Claude on hallucination and factual consistency rate in a new leaderboard👇 initiated by @vectara . 📢RL/DPO is getting so important to improve the model quality, particularly for responsible AI. 🎯Code to fine-tune NeuralChat:
Tweet media one
5
23
123
@HaihaoShen
Haihao Shen
22 days
🔥Zero accuracy loss of INT4 model, even comparing with FP16 model. @hunkims See the recipes published in the model card or reach us for the quantized models👇 🎯We will be releasing an improved version of low-bit quantized LLM leaderboard with new models on 6/6. Stay tuned!
Tweet media one
3
18
123
@HaihaoShen
Haihao Shen
2 months
💕We love open-source and contributed Intel Neural Compressor () to ONNX community. Now it's available , as a quantization tool for ONNX models. 🎯Give a try and share us your feedbacks! @NoWayYesWei @humaabidi @melisevers @arungupta
0
25
119
@HaihaoShen
Haihao Shen
6 months
🤗Neural Speed now supports GGUF (used in llama.cpp)! 📢Neural Speed is an innovation library, a sibling project with Intel Neural Compressor. 🎯Neural Compressor🔚Algorithm + Accuracy 🚀Neural Speed 🔚 Kernel + Performance 🌟
3
21
119
@HaihaoShen
Haihao Shen
8 months
🔥Want Intel-enhanced llama.cpp? Yes, up to 15x on 1st token gen and 1.5x on other token gen on Intel latest Xeon Scalable Processor (SPR) 📕Blog: Code: #oneapi @intel @huggingface @_akhaliq @llama @llama_index @llama
3
29
113
@HaihaoShen
Haihao Shen
3 months
🔥All your need is Intel Neural Compressor (INC) for INT4 LLMs. INC v2.5 released with SOTA INT4 LLM quantization (AutoRound) across platforms incl. Intel Gaudi2, Xeon, and GPU. 🎯Models: Llama2, Mistral, Mixtral-MOE, Gemma, Mistral-v0.2, Phi2, Qwen, ...🤗
2
17
116
@HaihaoShen
Haihao Shen
5 months
🎯Embedding model is super important for RAG system. Here is a tutorial showing how to tune BAAI/bge-base for high performance. 📔 💣 Extended LangChain to load optimized embedding model and improved the inference on Intel platforms.
1
17
114
@HaihaoShen
Haihao Shen
4 months
㊗️Our paper on "FP8 recipes" has been accepted by MLSys'24. Congrats to all the collaborators @navikm Xin, Qun, Chang, and Mengni! 🤗Paper: 🎯Code:
Tweet media one
4
17
111
@HaihaoShen
Haihao Shen
6 months
📢More Intel NeuralChat-v3 7B LLMs are released, and more technical details are published in the blog👇 🎯Blog: 🙌Welcome to use @intel NeuralChat-v3🤗, which runs highly efficient on Intel platforms using Intel AI SWs. #iamintel #intelai @huggingface
7
16
108
@HaihaoShen
Haihao Shen
4 months
🎯High performance INT4 Mistral-7B model available on @huggingface , quantized by Intel Neural Compressor (outperforming GPTQ & AWQ) and efficiently inferenced by Intel Extension for Transformers! 🤗 Model: 🌟,
6
25
103
@HaihaoShen
Haihao Shen
28 days
🔥Wanted to run GGUF models faster on Intel platforms? Here you go - all your need is Intel Extension for Transformers: with up to 7x better performance boost and 7x smaller model size! Look forward to your early feedbacks! 🎯
3
27
107
@HaihaoShen
Haihao Shen
2 months
🎯Meta launched Llama3. See how it works well across Gaudi, Xeon, GPU, and AIPC! Check out the blog: 🔥Happy to share with you AutoRound in Intel Neural Compressor was used to quantize Llama3 INT4 model with the SOTA accuracy!
3
28
107
@HaihaoShen
Haihao Shen
3 months
🔥MLPerf Inference v4.0 inference is out! 1⃣The only CPU able to achieve 99.9% accuracy 2⃣1.8x perf speedup over last submission 3⃣Summarize a news article pre second in real-time 📘Blog: 🎯Code for MLPerf GPT-J: #MLPerf #IAmIntel
0
17
100
@HaihaoShen
Haihao Shen
1 year
🎯Want to quantize Transformer model without coding? Yes, use Neural Coder + Optimum-Intel. 🧨5,000+ Transformer models quantized automatically 🔥Neural Coder demo on Hugging Face Spaces: . ⭐️Check it out for a try! @ellacharlaix @jeffboudier @_akhaliq
1
25
101
@HaihaoShen
Haihao Shen
8 months
❓Fine-tuning or RAG? Don't know how to select. 🎯Fine-tuning is not the only way to make your LLM smarter! You can also try RAG. Here are the recommendations and examples: 📢Reproducible through Intel Extension for Transformers: 🚀
4
15
101
@HaihaoShen
Haihao Shen
2 years
🎯We are hosting our personalized Stable Diffusion model with a newly-added object "dicoo" on Hugging Face Spaces: . 🤗Try it out! If you want to replicate the fine-tuning, please visit our previous blog:
3
24
96
@HaihaoShen
Haihao Shen
6 months
📢Slimmed BGE embedding models are coming, shortly after quantized ones. More importantly, slimming and quantization can be combined together! 🎁 Private RAG-based chatbots on clients are more accessible! 👨‍💻 🎯 #intelai #NeuralChat
0
18
96
@HaihaoShen
Haihao Shen
2 years
🎯Happy to announce the source code and examples of "Fast DistilBERT on CPU" (accepted by NeurIPS'22 paper) was released: 🧨Included in Top NLP Papers Nov'22 by @CohereAI and highlighted "Fast Transformers on CPUs with SOTA performance" by @Synced_Global !
0
9
93
@HaihaoShen
Haihao Shen
9 months
📢"Efficient Post-training Quantization with FP8 Formats" is published! Thanks to the great collaborators! 🎯We released all the FP8 recipes in Intel Neural Compressor: . Check it out!
Tweet media one
1
22
93
@HaihaoShen
Haihao Shen
3 months
⚡️Breaking news: Open Platform for Enterprise AI (OPEA) is announced by Pat! A lots of great partners👍 🎯The base code is here: , powered by ecosystem projects such as Transformers, TGI, LangChain and the technology from Intel Extension for Transformers.
1
19
90
@HaihaoShen
Haihao Shen
6 months
👨‍💻If you missed CES 2024 Intel copilot demo, no worry, here is the video. 🎯Features: 1) run on your PC for copilot chat, so it's 100% free and safe; 2) run on server for code generation, so it may generate better code; 3) smart model switch. VS plugin is coming🚀 #intelai @intel
2
16
85
@HaihaoShen
Haihao Shen
4 months
🩷A memorable day: Intel Neural Compressor and Intel Extension for Transformers crossed! A baby Neural Speed is on board!!🌟
Tweet media one
0
6
81
@HaihaoShen
Haihao Shen
9 months
🔥Happy to publish the code of SignRound (a leading INT4 quantization method) : 📕Paper: 👉Code: 📢Leave a star if you find it's useful.
Tweet media one
0
22
78
@HaihaoShen
Haihao Shen
7 months
It has been a great experience to see the rapid growth of LLMs in open-source community. We are proud to see @intelai created LLMs & datasets are welcomed and being used/discussed/improved. Go, Intel LLMs!
@IntelAI
Intel AI
7 months
Congrats to Intel team members Haihao Shen and Kaokao Lv for their fine-tuned version of Mistral 7B having hit the top of the list on the @huggingface LLM leaderboard last week: Fine-tuned on 8x Intel Gaudi2 Accelerators.
Tweet media one
2
12
119
3
14
73
@HaihaoShen
Haihao Shen
3 months
🔥Want to use FP8 inference easily? Intel Neural Compressor is your best choice: 🎯Shared with you our MLSys'24 camera-ready paper: Efficient Post-Training Quantization with FP8 Formats 🤗 @_akhaliq @navikm @huggingface #IAmIntel
Tweet media one
0
18
75
@HaihaoShen
Haihao Shen
4 months
🎯How MX data types work for LLMs? New quantization recipes validated by Intel using Neural Compressor, HW architecture and data types proposed by MSFT and defined by OCP 📢Here is a tutorial: with source code publicly available in
0
11
69
@HaihaoShen
Haihao Shen
3 months
🌟Happy to announce Intel Extension for Transformers v1.4 released with a lot of improvements in building GenAI applications on Intel platforms! 🎯Check out the release notes: 🤗 @intel + @huggingface = one of the best GenAI platforms
0
10
74
@HaihaoShen
Haihao Shen
6 months
🚀Happy to support "upstage/SOLAR-10.7B-Instruct-v1.0" in Intel Extension for Transformers! @upstageai @hunkims . INT4 inference is available with one parameter change from "load_in_8bit" to "load_in_4bit". 📢Next one will be Zephyr🙌 👇Check out the sample code and give a try!
Tweet media one
0
13
72
@HaihaoShen
Haihao Shen
5 months
🎁Here is a tutorial on how to optimize natural language embedding model and extend LangChain to enable the optimizations. Check out more details: 🤗Code: . Star the project if you find this is useful. 🌟Happy Chinese New Year! 🎇
0
14
67
@HaihaoShen
Haihao Shen
3 months
👨‍💻2023 is year of open LLMs. Is it time to predict for 2024? DM your thoughts. 📢Re-share the blog from @clefourrier : , incl. Intel NeuralChat-7B and DPO dataset😀 🤗We hope to contribute more to open-source LLM community in 2024! #iamintel @huggingface
4
12
66
@HaihaoShen
Haihao Shen
4 months
🔥Happy to announce Intel Extension for Transformers v1.3.2 released 📔Release notes: 🎯Highlights: enable popular serving e.g., @huggingface TGI, vLLM, and Triton to build highly efficient chatbot on Intel platforms such as Gaudi2 with a few lines of code
0
9
65
@HaihaoShen
Haihao Shen
16 days
🆕Thrilled to share with you that highly-efficient 4-bit kernels have been integrated into AutoAWQ. Congrats Penghui and thanks @casper_hansen_ for the review! Now, you can run AutoAWQ super fast on CPU with PR: 🔥Base code from
1
15
64
@HaihaoShen
Haihao Shen
6 months
📢Intel Copilot in CES 2024 automatically created a Chatbot for the event! Watch the video of Great Minds keynote: delivered by Intel leaders!! 🎯The copilot is built on top of . The code/ext will be released soon. Stay tuned!🚀
0
8
64
@HaihaoShen
Haihao Shen
26 days
🔥Thrilled to announce OPEA v0.6 released, featuring 10 micro-service components (e.g., embedding, LLMs), 4 GenAI examples, and 1-click K8S deployment. 🎯Github: Give a try and create GenAI app with your private data! @NoWayYesWei @humaabidi @melisevers
2
11
62
@HaihaoShen
Haihao Shen
4 months
📢Exciting news! Stable Diffusion on Gaudi!! We released Intel Extension for Transformers to simplify LLM fine-tuning and accelerate LLM inference further🚀
@StabilityAI
Stability AI
4 months
In this installment of "Behind the Compute", a series dedicated to offering insights for others to harness the power of generative AI, we compared the training speed of @Intel Gaudi 2 accelerators versus @Nvidia 's A100 and H100 for two of our models. (1/3)
Tweet media one
17
62
305
1
12
60
@HaihaoShen
Haihao Shen
2 months
🔥If you want the best 4-bit models e.g., Phi-3, Mistral, Solar, Intel Neural Compressor with AutoRound is your choice!🌟
Tweet media one
@HaihaoShen
Haihao Shen
2 months
⚡️Leaderboard snapshot on 2024/5/10. INC quantized models using AutoRound () outperform the models (publicly available on HuggingFace) quantized by other popular quantization approaches such as GPTQ, AWQ, GGUF, etc. Pick up the best INT4 model for your use!
Tweet media one
1
5
39
1
15
55
@HaihaoShen
Haihao Shen
8 months
📢We are hiring full-time interns for efficient LLM inference. 🔥Group: Intel/DCAI/AISE 🎯Location: Shanghai, Zizhu 😀Working projects: * INC: * ITREX: If you are interested in LLM compression and inference, DM with your resume.😀
3
11
52