Creator of Intel Neural Compressor/Speed/Coder, Intel Ext. for Transformers, AutoRound; HF Optimum-Intel Maintainer; Founding member of OPEA; Opinions my own
🔥Great to share with you that
@HabanaLabs
Gaudi accelerator has been supported in AutoGPTQ.
🎯PR: Congrats Danny and Habana team! Thanks to fxmarty!
⏰Intel Neural Compressor will be fully supporting the model compression for Gaudi soon. Stay tuned!
🧩No GPU but wanna create your own LLM on laptop?
🎁Here is a gift for you: QLoRA on CPU, making LLM fine-tuning on client CPU possible! Just give a try.
📔Blog: Kudos to ITREX team!
🎯Code:
#IAmIntel
#intelai
@intel
@huggingface
📢Just change the model name, you can run LLMs blazingly fast on your PC using Intel Extension for Transformers powered by SOTA low-bit quantization!
🎯Code: , supporting Mistral, Llama2, Mixtral-MOE, Phi2, Solar, most recent LLMs.
🤗
🎯We released GPT-J-6B INT8 ONNX models (first time for INT8 ONNX LLM❓) with ~4x model size reduction while preserving ~99.9% accuracy of FP32 baseline.
🔥GPT-J-6B INT8 models are now publicly available at Hugging Face model hub!
🚀Accelerate LLM inference on your laptop, again CPU! Up to 4x on Intel i7-12900 over llama.cpp!
🎯Code:
📢Chatbot demo on PC: ; Hugging Face space demo locally:
#oneapi
@intel
@huggingface
@_akhaliq
@Gradio
🤗Intel Extension for Transformers supports Mixtral-8-7B with 8-bit and 4-bit inference optimizations on Intel platforms! Start from CPUs🚀
🙌Don't hesitate to give a try. Sample code below👇
🎯Project:
#iamintel
#intelai
@intel
@huggingface
⚡️AutoRound, new SOTA LLM low-bit quantization approach developed by Intel Neural Compressor team ()
🎯Lots of interesting comparison with GPTQ, AWQ, HQQ, etc. Check out the blog for more details:
@huggingface
#IAmIntel
📢We are hiring full-time interns for LLM-based workflow development (e.g., retrieval-augmented generation for domain chatbot, co-pilot assistant, ...)
📷Location: Shanghai (or working remote in PRC)
🎯Project:
If you have interests, DM with your resume.😀
♥️ Happy Thanksgiving! Thanks to my family, friends, colleagues, partners, collaborators! Love you all!!
🔥We released QLoRA for CPU, to help you enable fine-tune LLMs on your laptop! See below👇
📢Code:
#deeplearning
#intelai
#GenAI
@intel
@huggingface
🔥Wanted to quantize 100B+ model on your laptop with 16GB memory? Hmmm, GPTQ does not work...
🎯Intel Neural Compressor supports layer-wise quantization, unlocking LLM quantization on your laptop! Up to 1000B model❓
📕Blog:
#oneapi
@intel
@huggingface
🚀Share with you a nice blog "llama.cpp + Intel GPUs". Congrats to the awesome team especially Jianyu, Hengyu, Yu, and Abhilash, and thanks to
@ggerganov
for your great support.
📢Check out the blog:
🎯WIP with ollama now
#iamintel
#llama
@ollama
📢Do you want to make your LLM inference fast, accurate, and infinite (up to M tokens)? Here is the improved StreamingLLM with re-evaluate and shift-RoPE-K support on CPUs!
🔥Code:
📕Doc:
#oneapi
@intel
@huggingface
@Guangxuan_Xiao
🔥llama.cpp officially supports Sycl, showing promising perf gains over OpenCL. Give a shot on Intel GPUs e.g., Arc 770!
PR:
Congrats Abhilash/Jianyu/Hengyu/Yu! Thanks
@ggerganov
for the review! Transformer-like API soon in
@RajaXg
🔥Wanted to quantize LLMs with best accuracy & smallest size, Intel Neural Compressor is your choice. We just released v2.6 featuring SOTA LLM quantizer, outperforming GPTQ/AWQ on typical LLMs.
🎯Quantized LLM leaderboard:
Github:
🤗Intel Extension for Transformers enables running microsoft/phi-2 smoothly on laptop (faster than human speed🚀). Sample code👇
🎯Code: . Try and have funs!
🎁DM your favorite LLM. Next will be Solar :)
#iamintel
#intelai
@intel
@huggingface
@murilocurti
🎯Intel Extension for Transformers (powered by Neural Speed) now supports GGUF with compatible API to Hugging Face Transformers yet blazing fast (up to 50x?) for LLM inference on AIPC (even on CPU cores)
🔥Repo: (PS: some friends called "open-source Groq")
📢Just created an open-source project to speed up LLMs dedicatedly
🌟Project:
🤗Look forward to your suggestions and let me know the topics that you may have interests and want to see.
#LLM
@intel
@huggingface
📢Continue making LLMs more accessible! Neural Compressor supports layer-wise GPTQ for INT4 quantization up to 1TB ~ 10TB (though not open-sourced yet) even on consumer HW!
📕Instruction:
🌟Project:
#oneapi
@intel
@huggingface
#LLM
🚀Thrilled to announce that NeuralSpeed v1.0 alpha is released! Highly optimized INT4 kernels and blazing fast LLM inference on CPUs!
🎯Integrated by ONNX Runtime; WIP: contribute to AutoAWQ
@casper_hansen_
and AutoGPTQ
📔 Blog:
🔥
🚀Highly-efficient x86 INT4 kernels are now available in ONNX Runtime. Use Intel Neural Compressor to quantize LLMs and run efficiently with ONNX Runtime on Intel CPUs!
📔PR:
🎯Source of INT4 kernels:
#intelai
@intelai
@huggingface
🎁Thrilled to share Intel Neural Compressor v2.4 is out on a nice snowy day in SH, a special release for model quantization/compression for LLMs, helping to bring AI everywhere.
👨💻Release notes:
🚀Code:
#iamintel
#intelai
#oneapi
🚀Embedding is super fast on SPR! Just ~500 seconds for 1M samples (512 seq len/sample) using Intel optimized BGE model using INC and ITREX, making RAG more accessible!
📷Quick guide:
🎯
#iamintel
#intelai
@intelai
@huggingface
🔥Excited to share new BGE-base-v1.5 INT8 models within <1% accuracy loss from FP32 baseline on STS dataset (previous SST2)! BGE for RAG!!
🤗Model-1:
🤗Model-2:
🚀Code:
#oneapi
@IntelSoftware
@huggingface
🎁Happy New Year! We released Intel Neural Compressor v2.4.1 on the last working day in 2023!
📔Release notes:
🎯Code:
🩷Thanks to everyone who has provided your support & help to INC. We are committed to make it better in 2024! 🤗
📽️Editing LLM knowledge is possible, e.g., Rank-One Model Editing (ROME).
📔Paper:
🎯Sample code:
💣The technology behind looks interesting and useful, which is supposed to work with SFT and RAG to reduce the hallucination!
📢Happy to share Intel Extension for Transformers v1.0 released:
🎯 NeuralChat, a custom Chatbot on domain knowledge through Hugging Face PEFT. Now, you can create your own Chatbot within 1 hours on CPUs.
@humaabidi
@MosheWasserblat
@jeffboudier
🎯When DeepSpeed meets Intel AI SWs, the performance magic happens!
🚀Accelerate Llama 2 inference on Xeon SPR by up to ~1.7x!
📔Blog:
🎁Intel AI SWs:
IPEX:
INC:
and
#oneapi
@intelai
@AIatMeta
@MSFTDeepSpeed
🤗NeuralChat beats GPT4 and Claude on hallucination and factual consistency rate in a new leaderboard👇 initiated by
@vectara
.
📢RL/DPO is getting so important to improve the model quality, particularly for responsible AI.
🎯Code to fine-tune NeuralChat:
🔥Zero accuracy loss of INT4 model, even comparing with FP16 model.
@hunkims
See the recipes published in the model card or reach us for the quantized models👇
🎯We will be releasing an improved version of low-bit quantized LLM leaderboard with new models on 6/6. Stay tuned!
📢Happy to share INT4 inference on
@intel
GPUs (e.g., PVC & Arc) is available in Intel Ext. for Transformers as an experimental feature (powered by IPEX)! More are coming!!
🎯Release notes:
🚀Code:
#intelai
#intelgpu
@huggingface
📢NeuralChat, an open chat framework created by
@intel
, now supports the
@huggingface
assisted generation to make chatbot more efficient on Intel platforms!
🎯Guide to deploy a chatbot:
🚀Code:
#iamintel
#intelai
Go, ITREX!
💕We love open-source and contributed Intel Neural Compressor () to ONNX community. Now it's available , as a quantization tool for ONNX models.
🎯Give a try and share us your feedbacks!
@NoWayYesWei
@humaabidi
@melisevers
@arungupta
🔥All your need is Intel Neural Compressor (INC) for INT4 LLMs. INC v2.5 released with SOTA INT4 LLM quantization (AutoRound) across platforms incl. Intel Gaudi2, Xeon, and GPU.
🎯Models: Llama2, Mistral, Mixtral-MOE, Gemma, Mistral-v0.2, Phi2, Qwen, ...🤗
🎯Embedding model is super important for RAG system. Here is a tutorial showing how to tune BAAI/bge-base for high performance.
📔
💣 Extended LangChain to load optimized embedding model and improved the inference on Intel platforms.
📢More Intel NeuralChat-v3 7B LLMs are released, and more technical details are published in the blog👇
🎯Blog:
🙌Welcome to use
@intel
NeuralChat-v3🤗, which runs highly efficient on Intel platforms using Intel AI SWs.
#iamintel
#intelai
@huggingface
🎯High performance INT4 Mistral-7B model available on
@huggingface
, quantized by Intel Neural Compressor (outperforming GPTQ & AWQ) and efficiently inferenced by Intel Extension for Transformers!
🤗 Model:
🌟,
🔥Wanted to run GGUF models faster on Intel platforms? Here you go - all your need is Intel Extension for Transformers: with up to 7x better performance boost and 7x smaller model size! Look forward to your early feedbacks! 🎯
🎯Meta launched Llama3. See how it works well across Gaudi, Xeon, GPU, and AIPC! Check out the blog:
🔥Happy to share with you AutoRound in Intel Neural Compressor was used to quantize Llama3 INT4 model with the SOTA accuracy!
🔥MLPerf Inference v4.0 inference is out!
1⃣The only CPU able to achieve 99.9% accuracy
2⃣1.8x perf speedup over last submission
3⃣Summarize a news article pre second in real-time
📘Blog:
🎯Code for MLPerf GPT-J:
#MLPerf
#IAmIntel
🎯Want to quantize Transformer model without coding? Yes, use Neural Coder + Optimum-Intel.
🧨5,000+ Transformer models quantized automatically
🔥Neural Coder demo on Hugging Face Spaces: .
⭐️Check it out for a try!
@ellacharlaix
@jeffboudier
@_akhaliq
❓Fine-tuning or RAG? Don't know how to select.
🎯Fine-tuning is not the only way to make your LLM smarter! You can also try RAG. Here are the recommendations and examples:
📢Reproducible through Intel Extension for Transformers: 🚀
🎯We are hosting our personalized Stable Diffusion model with a newly-added object "dicoo" on Hugging Face Spaces: . 🤗Try it out! If you want to replicate the fine-tuning, please visit our previous blog:
📢Slimmed BGE embedding models are coming, shortly after quantized ones. More importantly, slimming and quantization can be combined together!
🎁 Private RAG-based chatbots on clients are more accessible!
👨💻
🎯
#intelai
#NeuralChat
🎯Happy to announce the source code and examples of "Fast DistilBERT on CPU" (accepted by NeurIPS'22 paper) was released:
🧨Included in Top NLP Papers Nov'22 by
@CohereAI
and highlighted "Fast Transformers on CPUs with SOTA performance" by
@Synced_Global
!
📢"Efficient Post-training Quantization with FP8 Formats" is published! Thanks to the great collaborators!
🎯We released all the FP8 recipes in Intel Neural Compressor: . Check it out!
⚡️Breaking news: Open Platform for Enterprise AI (OPEA) is announced by Pat! A lots of great partners👍
🎯The base code is here: , powered by ecosystem projects such as Transformers, TGI, LangChain and the technology from Intel Extension for Transformers.
👨💻If you missed CES 2024 Intel copilot demo, no worry, here is the video.
🎯Features: 1) run on your PC for copilot chat, so it's 100% free and safe; 2) run on server for code generation, so it may generate better code; 3) smart model switch. VS plugin is coming🚀
#intelai
@intel
💣Happy to announce INT4 NeuralChat-7B models available on
@huggingface
, powered by SOTA INT4 algorithm developed by Intel, yet compatible with AutoGPTQ!
🤗
🤗
📔Paper:
🎯Sample code:
📢INT4 GPTQ and RTN landed in ONNX Runtime through Intel Neural Compressor. AI on PC is coming!
📔PR: Thanks to Yuwen, Mengni, and Yufeng!
🌟Code:
#intelai
#onnxruntime
#neuralcompressor
🎁Happy to announce Intel Extension for Transformers supports INT8 quantization for MSFT Phi, making Phi inference more efficient and accessible than ever!
📔Quick guide:
🎯Code available:
#iamintel
#intelai
@intel
@huggingface
It has been a great experience to see the rapid growth of LLMs in open-source community. We are proud to see
@intelai
created LLMs & datasets are welcomed and being used/discussed/improved. Go, Intel LLMs!
Congrats to Intel team members Haihao Shen and Kaokao Lv for their fine-tuned version of Mistral 7B having hit the top of the list on the
@huggingface
LLM leaderboard last week:
Fine-tuned on 8x Intel Gaudi2 Accelerators.
🔥Want to use FP8 inference easily? Intel Neural Compressor is your best choice:
🎯Shared with you our MLSys'24 camera-ready paper: Efficient Post-Training Quantization with FP8 Formats
🤗
@_akhaliq
@navikm
@huggingface
#IAmIntel
🎯Wanted to enable audio in your chatbot? Just few minutes.
📕Here is a guide for you, including ASR, TTS, audio processing, audio streaming, multi-lang EN & CN:
📢Optimized code: with🤗models
#iamintel
#intelai
@intel
@huggingface
🎯How MX data types work for LLMs? New quantization recipes validated by Intel using Neural Compressor, HW architecture and data types proposed by MSFT and defined by OCP
📢Here is a tutorial: with source code publicly available in
🌟Happy to announce Intel Extension for Transformers v1.4 released with a lot of improvements in building GenAI applications on Intel platforms!
🎯Check out the release notes:
🤗
@intel
+
@huggingface
= one of the best GenAI platforms
🚀Happy to support "upstage/SOLAR-10.7B-Instruct-v1.0" in Intel Extension for Transformers!
@upstageai
@hunkims
. INT4 inference is available with one parameter change from "load_in_8bit" to "load_in_4bit".
📢Next one will be Zephyr🙌
👇Check out the sample code and give a try!
📢Intel Extension for Transformers () supports INT4 and low-bit inference on both CPUs and GPUs!
📔Simple usage guide:
🔥All your need is to get an Intel GPU and run LLMs
@huggingface
🤗
📢When AI meets cybersecurity, see how Intel NeuralChat LLM helps here. Happy to share a nice blog "Harnessing the Intel NeuralChat 7B Model for Advanced Fraud Detection". Congrats
@Saminusalisu
!
🎯Check out the details:
#intelai
#iamintel
@humaabidi
🎁Here is a tutorial on how to optimize natural language embedding model and extend LangChain to enable the optimizations. Check out more details:
🤗Code: . Star the project if you find this is useful.
🌟Happy Chinese New Year! 🎇
👨💻2023 is year of open LLMs. Is it time to predict for 2024? DM your thoughts.
📢Re-share the blog from
@clefourrier
: , incl. Intel NeuralChat-7B and DPO dataset😀
🤗We hope to contribute more to open-source LLM community in 2024!
#iamintel
@huggingface
🔥Happy to announce Intel Extension for Transformers v1.3.2 released
📔Release notes:
🎯Highlights: enable popular serving e.g.,
@huggingface
TGI, vLLM, and Triton to build highly efficient chatbot on Intel platforms such as Gaudi2 with a few lines of code
🥳Happy to share with you the Intel optimizations
for Diffusers textual inversion and the fine-tuning demo of Stable Diffusion on Spaces!
👉 Intel optimizations:
🎯Spaces:
🤗Thanks to Patrick,
@anton_lozhkov
@_akhaliq
from HF!
🆕Thrilled to share with you that highly-efficient 4-bit kernels have been integrated into AutoAWQ. Congrats Penghui and thanks
@casper_hansen_
for the review! Now, you can run AutoAWQ super fast on CPU with PR:
🔥Base code from
📢Intel Copilot in CES 2024 automatically created a Chatbot for the event! Watch the video of Great Minds keynote: delivered by Intel leaders!!
🎯The copilot is built on top of . The code/ext will be released soon. Stay tuned!🚀
🔥Thrilled to announce OPEA v0.6 released, featuring 10 micro-service components (e.g., embedding, LLMs), 4 GenAI examples, and 1-click K8S deployment.
🎯Github: Give a try and create GenAI app with your private data!
@NoWayYesWei
@humaabidi
@melisevers
🤗Want to build an enterprise-grade RAG system? Efficient embedding is what you want. Here is a nice blog from Intel and
@huggingface
friends on "Intel Fast Embedding" with and
#IAmIntel
@MosheWasserblat
📢Exciting news! Stable Diffusion on Gaudi!! We released Intel Extension for Transformers to simplify LLM fine-tuning and accelerate LLM inference further🚀
In this installment of "Behind the Compute", a series dedicated to offering insights for others to harness the power of generative AI, we compared the training speed of
@Intel
Gaudi 2 accelerators versus
@Nvidia
's A100 and H100 for two of our models. (1/3)
⚡️Leaderboard snapshot on 2024/5/10. INC quantized models using AutoRound () outperform the models (publicly available on HuggingFace) quantized by other popular quantization approaches such as GPTQ, AWQ, GGUF, etc. Pick up the best INT4 model for your use!
📢We are hiring full-time interns for efficient LLM inference.
🔥Group: Intel/DCAI/AISE
🎯Location: Shanghai, Zizhu
😀Working projects:
* INC:
* ITREX:
If you are interested in LLM compression and inference, DM with your resume.😀