Haihao Shen @HaihaoShen profile

Haihao Shen

@HaihaoShen

Followers

2,944

Following

2,710

Media

40

Statuses

436

Creator of Intel Neural Compressor/Speed/Coder, Intel Ext. for Transformers, AutoRound; HF Optimum-Intel Maintainer; Founding member of OPEA; Opinions my own

https://t.co/GvjGNUlhB4

Shanghai

Joined September 2021

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

#T20WorldCup • 198057 Tweets

Bardella • 171448 Tweets

#INDvENG • 135225 Tweets

Speed • 125112 Tweets

#loveIsland • 121256 Tweets

Lakers • 116395 Tweets

Grace • 112305 Tweets

Tony • 111380 Tweets

Kohli • 107437 Tweets

EFCC • 75102 Tweets

Bapu • 70550 Tweets

Raul • 67823 Tweets

Bronny • 67524 Tweets

Sharon • 66470 Tweets

#TemptationIsland • 51072 Tweets

Pete • 49800 Tweets

Joey • 48577 Tweets

Jess • 42905 Tweets

Jenny • 31499 Tweets

Oklahoma • 31211 Tweets

Dube • 30945 Tweets

Olga • 30451 Tweets

Aida • 27321 Tweets

Bumrah • 23098 Tweets

Kuldeep • 21658 Tweets

Faure • 21441 Tweets

Harriet • 19000 Tweets

Polônia • 16779 Tweets

ESCUELA DE COCINA FURIOSA • 16096 Tweets

Tractor Supply • 15769 Tweets

Jadeja • 13810 Tweets

Adara • 13746 Tweets

#SVAllStarsGala2 • 13138 Tweets

Buttler • 12661 Tweets

Martina • 10574 Tweets

Bob Myers

Etsy CraftyGPT

I LOVE PALESTINE

Adem Bona

Bernardinho

Borges

ross lynch

Gözaltına Alındı

Into the Finals

Kalash

Matilda

Bairstow

Sunday Ticket

Alessia

Rich Paul

Last Seen Profiles

@RichardDunwoody

@doieprio

@SAL_587

@swswimclub

@ichsachdatt

@whereisgrotius

@MatGilbertson

@moggle

@kerry_faith

@RealHansZimmer

@i6_dun

@RagasaOng

@DarkTunki

@TvZlatar

@postmortemdate

@turk_ifsa2019

@feuillesbleues

@DukesRacing

@bigballsdaclown

@tretyakovaraim5

Pinned Tweet

Haihao Shen

@HaihaoShen

1 day

🔥Great to share with you that @HabanaLabs Gaudi accelerator has been supported in AutoGPTQ. 🎯PR: Congrats Danny and Habana team! Thanks to fxmarty! ⏰Intel Neural Compressor will be fully supporting the model compression for Gaudi soon. Stay tuned!

Supporting uint4 inference of pre-quantized models in HPU by HolyFalafel · Pull Request #689 ·...

Added native support in HPU, using a conversion kernel. Currently we only support inference on a preloaded HF model. This feature will be usable in Synapse v1.17 Supporting llama uint4 inference u...

github.com

0

10

41

Haihao Shen

@HaihaoShen

7 months

🧩No GPU but wanna create your own LLM on laptop? 🎁Here is a gift for you: QLoRA on CPU, making LLM fine-tuning on client CPU possible! Just give a try. 📔Blog: Kudos to ITREX team! 🎯Code: #IAmIntel #intelai @intel @huggingface

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

6

161

675

Haihao Shen

@HaihaoShen

8 months

🔥Excited to share our NeurIPS'23 paper on Efficient LLM inference on CPUs! Compatible with GGML yet better performance up to 1.5x over llama.cpp! 📢Paper: 📕Code: #oneapi @intel @huggingface @_akhaliq @MosheWasserblat

7

108

659

Haihao Shen

@HaihaoShen

7 months

🚀Up to 3x LLM inference speedup using speculative decoding from @huggingface with Intel optimizations! 📘Guide: 🎯Project: #oneapi #iamintel #intelai @intel @huggingface

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

3

56

340

Haihao Shen

@HaihaoShen

6 months

📢Just change the model name, you can run LLMs blazingly fast on your PC using Intel Extension for Transformers powered by SOTA low-bit quantization! 🎯Code: , supporting Mistral, Llama2, Mixtral-MOE, Phi2, Solar, most recent LLMs. 🤗

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

4

58

321

Haihao Shen

@HaihaoShen

1 year

🎯We released GPT-J-6B INT8 ONNX models (first time for INT8 ONNX LLM❓) with ~4x model size reduction while preserving ~99.9% accuracy of FP32 baseline. 🔥GPT-J-6B INT8 models are now publicly available at Hugging Face model hub!

Intel/gpt-j-6B-int8-dynamic-inc · Hugging Face

huggingface.co

6

47

280

Haihao Shen

@HaihaoShen

8 months

🚀Accelerate LLM inference on your laptop, again CPU! Up to 4x on Intel i7-12900 over llama.cpp! 🎯Code: 📢Chatbot demo on PC: ; Hugging Face space demo locally: #oneapi @intel @huggingface @_akhaliq @Gradio

9

56

274

Haihao Shen

@HaihaoShen

2 months

🔥Wanted to get the best low-bit LLM? Yes, we released a dedicated low-bit open LLM leaderboard for AIPC: , inspired from @huggingface LLM leaderboard! #intelai #inc #GPTQ #AWQ #GGUF @humaabidi @lvkaokao @_akhaliq @ollama @martin_casado @jeremyphoward

Low-bit Quantized Open LLM Leaderboard - a Hugging Face Space by Intel

huggingface.co

11

79

258

Haihao Shen

@HaihaoShen

8 months

🚀Thrilled to release INT8 BGE-1.5 models on Hugging Face and demonstrated ~5ms latency of embedding 512 seq length on Intel CPU! 👉Code: 🎯Models: ; #oneapi @intel @huggingface @_akhaliq @MosheWasserblat

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

6

40

240

Haihao Shen

@HaihaoShen

6 months

📢Thrilled to announce Intel Extension for Transformers v1.3 released, featuring 1) efficient low-bit inference and fine-tuning, and 2) improved open-source chatbot framework Neural Chat. 👨‍💻Notes: 🤗Code: X'mas and Happy New Year!

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

2

40

180

Haihao Shen

@HaihaoShen

6 months

🤗Intel Extension for Transformers supports Mixtral-8-7B with 8-bit and 4-bit inference optimizations on Intel platforms! Start from CPUs🚀 🙌Don't hesitate to give a try. Sample code below👇 🎯Project: #iamintel #intelai @intel @huggingface

5

41

232

Haihao Shen

@HaihaoShen

8 months

🥇NeuralChat, new Top-1 7B-sized LLM on leaderboard, is now from Intel, trained on Gaudi! 4-bit inference is supported!! 🎯Model: 📢Blog: #oneapi @intel @huggingface @clefourrier @_lewtun @jeffboudier @humaabidi @MosheWasserblat

Supervised Fine-Tuning and Direct Preference Optimization on Intel Gaudi2

Demonstrating a Top-Ranked 7B Chat Model on the LLM Leaderboard

medium.com

7

22

225

Haihao Shen

@HaihaoShen

5 months

📢If you don't have a GPU but want to run GPTQ and AWQ INT4 LLMs, here is the alternative to run well on your CPU: . Give it a shot!🤗

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

10

47

226

Haihao Shen

@HaihaoShen

3 months

⚡️AutoRound, new SOTA LLM low-bit quantization approach developed by Intel Neural Compressor team () 🎯Lots of interesting comparison with GPTQ, AWQ, HQQ, etc. Check out the blog for more details: @huggingface #IAmIntel

The AutoRound Quantization Algorithm

Weight-Only Quantization for LLMs Across Hardware Platforms

medium.com

4

53

217

Haihao Shen

@HaihaoShen

9 months

🔥INT4 whisper family models are out! Powered by Intel Extension for Transformers and INC! @YuwenZhou167648 @mengniwa @huggingface

Intel/whisper-large-v2-onnx-int4-inc · Hugging Face

huggingface.co

8

41

207

Haihao Shen

@HaihaoShen

7 months

📢Thrilled to announce the support of multi-turn chat using 4-bit LLM on Client CPU! 🚀Sample Code: 🔥Instruction to create gradio-based demo: 🎯Project: #oneapi @IntelSoftware @huggingface @Gradio @_akhaliq

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

6

55

203

Haihao Shen

@HaihaoShen

7 months

🚀 NeuralChat-7B-v3-1 continues ranking #1 in @huggingface open 7B LLM leaderboard! Even INT8 model ranked #3 !! 🤗Check out the leaderboard: 🥇Model: #iamintel #intelai #oneapi @intelai @lvkaokao

4

33

203

Haihao Shen

@HaihaoShen

8 months

📢We are hiring full-time interns for LLM-based workflow development (e.g., retrieval-augmented generation for domain chatbot, co-pilot assistant, ...) 📷Location: Shanghai (or working remote in PRC) 🎯Project: If you have interests, DM with your resume.😀

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

4

30

199

Haihao Shen

@HaihaoShen

7 months

♥️ Happy Thanksgiving! Thanks to my family, friends, colleagues, partners, collaborators! Love you all!! 🔥We released QLoRA for CPU, to help you enable fine-tune LLMs on your laptop! See below👇 📢Code: #deeplearning #intelai #GenAI @intel @huggingface

3

42

194

Haihao Shen

@HaihaoShen

8 months

🔥Wanted to quantize 100B+ model on your laptop with 16GB memory? Hmmm, GPTQ does not work... 🎯Intel Neural Compressor supports layer-wise quantization, unlocking LLM quantization on your laptop! Up to 1000B model❓ 📕Blog: #oneapi @intel @huggingface

Quantizing Large Language Models on Your Laptop

Layer-Wise Low-Bit Weight-Only Quantization

medium.com

8

36

197

Haihao Shen

@HaihaoShen

3 months

🚀Share with you a nice blog "llama.cpp + Intel GPUs". Congrats to the awesome team especially Jianyu, Hengyu, Yu, and Abhilash, and thanks to @ggerganov for your great support. 📢Check out the blog: 🎯WIP with ollama now #iamintel #llama @ollama

Run LLM on all Intel GPUs Using llama.cpp

www.intel.com

2

47

190

Haihao Shen

@HaihaoShen

8 months

📢StreamingLLM landed in Intel Extension for Transformers to support LLM inference infinity on CPU, up to 4M tokens! 🎯Check out the code: , search "StreamingLLM" and have a try! #oneapi @intel @huggingface @Guangxuan_Xiao @_akhaliq

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

1

42

186

Haihao Shen

@HaihaoShen

8 months

📢Do you want to make your LLM inference fast, accurate, and infinite (up to M tokens)? Here is the improved StreamingLLM with re-evaluate and shift-RoPE-K support on CPUs! 🔥Code: 📕Doc: #oneapi @intel @huggingface @Guangxuan_Xiao

1

39

184

Haihao Shen

@HaihaoShen

5 months

🚀Neural Speed + ONNX Runtime makes LLM inference more efficient on CPUs! 🎯Code: #intelai #aipc #onnxruntime #LLMs

GitHub - intel/neural-speed: An innovative library for efficient LLM inference via low-bit quanti...

An innovative library for efficient LLM inference via low-bit quantization - intel/neural-speed

github.com

3

32

184

Haihao Shen

@HaihaoShen

5 months

🔥llama.cpp officially supports Sycl, showing promising perf gains over OpenCL. Give a shot on Intel GPUs e.g., Arc 770! PR: Congrats Abhilash/Jianyu/Hengyu/Yu! Thanks @ggerganov for the review! Transformer-like API soon in @RajaXg

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

5

39

180

Haihao Shen

@HaihaoShen

11 days

🔥Wanted to quantize LLMs with best accuracy & smallest size, Intel Neural Compressor is your choice. We just released v2.6 featuring SOTA LLM quantizer, outperforming GPTQ/AWQ on typical LLMs. 🎯Quantized LLM leaderboard: Github:

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

github.com

1

38

184

Haihao Shen

@HaihaoShen

7 months

🚀Thrilled to share Intel Extension for Transformers supports INT4 model quantized by GPTQ on Intel platforms (Xeon & PC) ! 🎯Guide: 🤗Model: e.g., TheBloke/neural-chat-7B-v3-1-GPTQ ⚡️Code: #iamintel #oneapi @intelai @huggingface

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

1

23

177

Haihao Shen

@HaihaoShen

6 months

🤗Intel Extension for Transformers enables running microsoft/phi-2 smoothly on laptop (faster than human speed🚀). Sample code👇 🎯Code: . Try and have funs! 🎁DM your favorite LLM. Next will be Solar :) #iamintel #intelai @intel @huggingface @murilocurti

4

26

176

Haihao Shen

@HaihaoShen

1 month

🎯Intel Extension for Transformers (powered by Neural Speed) now supports GGUF with compatible API to Hugging Face Transformers yet blazing fast (up to 50x?) for LLM inference on AIPC (even on CPU cores) 🔥Repo: (PS: some friends called "open-source Groq")

8

42

164

Haihao Shen

@HaihaoShen

7 months

🔥Excited to share a nice blog from @andysingal about Top-performance 7B LLM NeuralChat-v3-1 from Intel: . Check out the blog and have a try on this model! ⚡️ #IAmIntel #intelai @intel @huggingface

Intel Neural-Chat 7b: Fine-Tuning on Gaudi2 for Top LLM Performance

huggingface.co

5

25

168

Haihao Shen

@HaihaoShen

7 months

📢Just created an open-source project to speed up LLMs dedicatedly 🌟Project: 🤗Look forward to your suggestions and let me know the topics that you may have interests and want to see. #LLM @intel @huggingface

GitHub - intel/neural-speed: An innovative library for efficient LLM inference via low-bit quanti...

An innovative library for efficient LLM inference via low-bit quantization - intel/neural-speed

github.com

4

26

168

Haihao Shen

@HaihaoShen

7 months

📢Continue making LLMs more accessible! Neural Compressor supports layer-wise GPTQ for INT4 quantization up to 1TB ~ 10TB (though not open-sourced yet) even on consumer HW! 📕Instruction: 🌟Project: #oneapi @intel @huggingface #LLM

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

github.com

1

24

165

Haihao Shen

@HaihaoShen

5 months

🚀Intel Extension for Transformers accelerates GGUF models now! GGUF is the new format introduced by llama.cpp🎆 🤗Project: #intelai #itrex #inc #gguf @intel @huggingface

1

34

163

Haihao Shen

@HaihaoShen

3 months

🚀Thrilled to announce that NeuralSpeed v1.0 alpha is released! Highly optimized INT4 kernels and blazing fast LLM inference on CPUs! 🎯Integrated by ONNX Runtime; WIP: contribute to AutoAWQ @casper_hansen_ and AutoGPTQ 📔 Blog: 🔥

GitHub - intel/neural-speed: An innovative library for efficient LLM inference via low-bit quanti...

An innovative library for efficient LLM inference via low-bit quantization - intel/neural-speed

github.com

6

32

159

Haihao Shen

@HaihaoShen

6 months

🚀Highly-efficient x86 INT4 kernels are now available in ONNX Runtime. Use Intel Neural Compressor to quantize LLMs and run efficiently with ONNX Runtime on Intel CPUs! 📔PR: 🎯Source of INT4 kernels: #intelai @intelai @huggingface

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

1

28

153

Haihao Shen

@HaihaoShen

7 months

🎯Excited to share another NeurIPS'23 paper titled "Effective Quantization for Diffusion Models on CPUs"! Congrats to all the collaborators! 🚀Code: 📜Paper: #iamintel #intelai @intel @huggingface @_akhaliq

Paper page - Effective Quantization for Diffusion Models on CPUs

huggingface.co

1

35

151

Haihao Shen

@HaihaoShen

6 months

🎁Thrilled to share Intel Neural Compressor v2.4 is out on a nice snowy day in SH, a special release for model quantization/compression for LLMs, helping to bring AI everywhere. 👨‍💻Release notes: 🚀Code: #iamintel #intelai #oneapi

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

github.com

1

30

149

Haihao Shen

@HaihaoShen

7 months

🚀Embedding is super fast on SPR! Just ~500 seconds for 1M samples (512 seq len/sample) using Intel optimized BGE model using INC and ITREX, making RAG more accessible! 📷Quick guide: 🎯 #iamintel #intelai @intelai @huggingface

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

0

31

146

Haihao Shen

@HaihaoShen

5 months

🎯 #1 INT4 LLM algorithm: AutoRound invented by @intel , showing SOTA accuracy in Mixtral-8x7B, Phi2, NeuralChat ... 🚀 #1 INT4 LLM inference: Intel Extension for Transformers, running efficiently on Intel devices 🌟 🤗

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

6

26

147

Haihao Shen

@HaihaoShen

7 months

🔥Excited to share new BGE-base-v1.5 INT8 models within <1% accuracy loss from FP32 baseline on STS dataset (previous SST2)! BGE for RAG!! 🤗Model-1: 🤗Model-2: 🚀Code: #oneapi @IntelSoftware @huggingface

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

2

28

144

Haihao Shen

@HaihaoShen

6 months

🎁Happy New Year! We released Intel Neural Compressor v2.4.1 on the last working day in 2023! 📔Release notes: 🎯Code: 🩷Thanks to everyone who has provided your support & help to INC. We are committed to make it better in 2024! 🤗

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

github.com

1

22

138

Haihao Shen

@HaihaoShen

4 months

📽️Editing LLM knowledge is possible, e.g., Rank-One Model Editing (ROME). 📔Paper: 🎯Sample code: 💣The technology behind looks interesting and useful, which is supposed to work with SFT and RAG to reduce the hallucination!

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

3

26

126

Haihao Shen

@HaihaoShen

5 months

🎯Quantization + Speculative decoding shows significant speedup up to 7.3x on Xeon using Intel AI SWs: 📢IPEX: ITREX: 🤗Blog: Congrats to @IntelAI and @huggingface team! @MosheWasserblat @humaabidi

Accelerate StarCoder with 🤗 Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

huggingface.co

2

26

128

Haihao Shen

@HaihaoShen

1 year

📢Happy to share Intel Extension for Transformers v1.0 released: 🎯 NeuralChat, a custom Chatbot on domain knowledge through Hugging Face PEFT. Now, you can create your own Chatbot within 1 hours on CPUs. @humaabidi @MosheWasserblat @jeffboudier

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

0

34

124

Haihao Shen

@HaihaoShen

7 months

🎯When DeepSpeed meets Intel AI SWs, the performance magic happens! 🚀Accelerate Llama 2 inference on Xeon SPR by up to ~1.7x! 📔Blog: 🎁Intel AI SWs: IPEX: INC: and #oneapi @intelai @AIatMeta @MSFTDeepSpeed

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

github.com

1

19

121

Haihao Shen

@HaihaoShen

3 months

🤗NeuralChat beats GPT4 and Claude on hallucination and factual consistency rate in a new leaderboard👇 initiated by @vectara . 📢RL/DPO is getting so important to improve the model quality, particularly for responsible AI. 🎯Code to fine-tune NeuralChat:

5

23

123

Haihao Shen

@HaihaoShen

22 days

🔥Zero accuracy loss of INT4 model, even comparing with FP16 model. @hunkims See the recipes published in the model card or reach us for the quantized models👇 🎯We will be releasing an improved version of low-bit quantized LLM leaderboard with new models on 6/6. Stay tuned!

3

18

123

Haihao Shen

@HaihaoShen

5 months

📢Happy to share INT4 inference on @intel GPUs (e.g., PVC & Arc) is available in Intel Ext. for Transformers as an experimental feature (powered by IPEX)! More are coming!! 🎯Release notes: 🚀Code: #intelai #intelgpu @huggingface

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

3

18

123

Haihao Shen

@HaihaoShen

7 months

📢NeuralChat, an open chat framework created by @intel , now supports the @huggingface assisted generation to make chatbot more efficient on Intel platforms! 🎯Guide to deploy a chatbot: 🚀Code: #iamintel #intelai Go, ITREX!

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

0

16

120

Haihao Shen

@HaihaoShen

2 months

💕We love open-source and contributed Intel Neural Compressor () to ONNX community. Now it's available , as a quantization tool for ONNX models. 🎯Give a try and share us your feedbacks! @NoWayYesWei @humaabidi @melisevers @arungupta

GitHub - onnx/neural-compressor

Contribute to onnx/neural-compressor development by creating an account on GitHub.

github.com

0

25

119

Haihao Shen

@HaihaoShen

6 months

🤗Neural Speed now supports GGUF (used in llama.cpp)! 📢Neural Speed is an innovation library, a sibling project with Intel Neural Compressor. 🎯Neural Compressor🔚Algorithm + Accuracy 🚀Neural Speed 🔚 Kernel + Performance 🌟

GitHub - intel/neural-speed: An innovative library for efficient LLM inference via low-bit quanti...

An innovative library for efficient LLM inference via low-bit quantization - intel/neural-speed

github.com

3

21

119

Haihao Shen

@HaihaoShen

8 months

🔥Want Intel-enhanced llama.cpp? Yes, up to 15x on 1st token gen and 1.5x on other token gen on Intel latest Xeon Scalable Processor (SPR) 📕Blog: Code: #oneapi @intel @huggingface @_akhaliq @llama @llama_index @llama

Highly-efficient LLM Inference on Intel Platforms

Leadership performance yet compatible with llama.cpp

medium.com

3

29

113

Haihao Shen

@HaihaoShen

3 months

🔥All your need is Intel Neural Compressor (INC) for INT4 LLMs. INC v2.5 released with SOTA INT4 LLM quantization (AutoRound) across platforms incl. Intel Gaudi2, Xeon, and GPU. 🎯Models: Llama2, Mistral, Mixtral-MOE, Gemma, Mistral-v0.2, Phi2, Qwen, ...🤗

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

github.com

2

17

116

Haihao Shen

@HaihaoShen

5 months

🎯Embedding model is super important for RAG system. Here is a tutorial showing how to tune BAAI/bge-base for high performance. 📔 💣 Extended LangChain to load optimized embedding model and improved the inference on Intel platforms.

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

1

17

114

Haihao Shen

@HaihaoShen

8 months

🚀Even 3rd Intel Xeon ICX can run efficiently on LLM inference! See the demo below👇 📢Demo: 📕Code: Competitive TCO (perf/$)! More importantly, Xeon is almost everywhere!! #oneapi @intel @huggingface @_akhaliq @MosheWasserblat

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

1

33

109

Haihao Shen

@HaihaoShen

4 months

㊗️Our paper on "FP8 recipes" has been accepted by MLSys'24. Congrats to all the collaborators @navikm Xin, Qun, Chang, and Mengni! 🤗Paper: 🎯Code:

4

17

111

Haihao Shen

@HaihaoShen

7 months

🎁Intel Neural Chat-7B: when Mistral meets new hardware (Intel Gaudi2) & new data (Intel DPO dataset) 🚀Code: 🎯Model: 📽️Super cool video from @Sam_Witteveen #iamintel #intelai @intel @huggingface

Intel Neural Chat 7B - Mistral meets new hardware & new data

An overview of Intel's latest neural chat model, trained on custom hardware called Intel Gaudi 2. It discusses the model's unique features, performance, trai...

www.youtube.com

4

19

112

Haihao Shen

@HaihaoShen

6 months

📢More Intel NeuralChat-v3 7B LLMs are released, and more technical details are published in the blog👇 🎯Blog: 🙌Welcome to use @intel NeuralChat-v3🤗, which runs highly efficient on Intel platforms using Intel AI SWs. #iamintel #intelai @huggingface

Advancing Large Language Models on Intel Platforms

The Evolution of Intel NeuralChat-7B LLM

medium.com

7

16

108

Haihao Shen

@HaihaoShen

4 months

🎯High performance INT4 Mistral-7B model available on @huggingface , quantized by Intel Neural Compressor (outperforming GPTQ & AWQ) and efficiently inferenced by Intel Extension for Transformers! 🤗 Model: 🌟,

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

6

25

103

Haihao Shen

@HaihaoShen

28 days

🔥Wanted to run GGUF models faster on Intel platforms? Here you go - all your need is Intel Extension for Transformers: with up to 7x better performance boost and 7x smaller model size! Look forward to your early feedbacks! 🎯

Accelerating GGUF Models with Transformers

Improving Performance and Memory Usage on Intel Platforms

medium.com

3

27

107

Haihao Shen

@HaihaoShen

2 months

🎯Meta launched Llama3. See how it works well across Gaudi, Xeon, GPU, and AIPC! Check out the blog: 🔥Happy to share with you AutoRound in Intel Neural Compressor was used to quantize Llama3 INT4 model with the SOTA accuracy!

Llama 3 with Intel AI Solutions

Demonstrates how to accelerate Llama 3 using Intel AI Solutions

www.intel.com

3

28

107

Haihao Shen

@HaihaoShen

3 months

🔥MLPerf Inference v4.0 inference is out! 1⃣The only CPU able to achieve 99.9% accuracy 2⃣1.8x perf speedup over last submission 3⃣Summarize a news article pre second in real-time 📘Blog: 🎯Code for MLPerf GPT-J: #MLPerf #IAmIntel

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

0

17

100

Haihao Shen

@HaihaoShen

1 year

🎯Want to quantize Transformer model without coding? Yes, use Neural Coder + Optimum-Intel. 🧨5,000+ Transformer models quantized automatically 🔥Neural Coder demo on Hugging Face Spaces: . ⭐️Check it out for a try! @ellacharlaix @jeffboudier @_akhaliq

1

25

101

Haihao Shen

@HaihaoShen

8 months

❓Fine-tuning or RAG? Don't know how to select. 🎯Fine-tuning is not the only way to make your LLM smarter! You can also try RAG. Here are the recommendations and examples: 📢Reproducible through Intel Extension for Transformers: 🚀

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

4

15

101

Haihao Shen

@HaihaoShen

7 months

🔥NeuralChat ranked Top-1 among 7B LLMs in Open LLM Leaderboard! 🎯Code to reproduce Top-1 model: Congrats to Kaokao and the team! Thanks to @NoWayYesWei @humaabidi for great support! #oneapi @intel @huggingface @ClementDelangue @clefourrier @jeffboudier

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

0

18

99

Haihao Shen

@HaihaoShen

2 years

🎯We are hosting our personalized Stable Diffusion model with a newly-added object "dicoo" on Hugging Face Spaces: . 🤗Try it out! If you want to replicate the fine-tuning, please visit our previous blog:

Personalized Stable Diffusion with Few-Shot Fine-Tuning

Create Your Own Stable Diffusion on a Single CPU

medium.com

3

24

96

Haihao Shen

@HaihaoShen

7 months

🔥Thanks to @intheworldofai for publishing an amazing video to introduce NeuralChat, the most powerful 7B model crated by @IntelAI ranked Top-1 in @huggingface LLM open leaderboard! 🎯Video: #intelai @NoWayYesWei @humaabidi @KeDing @MosheWasserblat

Intel's Neural-Chat 7b: Most Powerful 7B Model! Beats GPT-4!?

Intel's Neural Chat 7B V3 is reshaping the landscape of language models, and in this video, we dive deep into its groundbreaking features and capabilities. ?...

www.youtube.com

2

27

97

Haihao Shen

@HaihaoShen

6 months

📢Slimmed BGE embedding models are coming, shortly after quantized ones. More importantly, slimming and quantization can be combined together! 🎁 Private RAG-based chatbots on clients are more accessible! 👨‍💻 🎯 #intelai #NeuralChat

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

0

18

96

Haihao Shen

@HaihaoShen

2 years

🎯Happy to announce the source code and examples of "Fast DistilBERT on CPU" (accepted by NeurIPS'22 paper) was released: 🧨Included in Top NLP Papers Nov'22 by @CohereAI and highlighted "Fast Transformers on CPUs with SOTA performance" by @Synced_Global !

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

0

9

93

Haihao Shen

@HaihaoShen

9 months

📢"Efficient Post-training Quantization with FP8 Formats" is published! Thanks to the great collaborators! 🎯We released all the FP8 recipes in Intel Neural Compressor: . Check it out!

1

22

93

Haihao Shen

@HaihaoShen

3 months

⚡️Breaking news: Open Platform for Enterprise AI (OPEA) is announced by Pat! A lots of great partners👍 🎯The base code is here: , powered by ecosystem projects such as Transformers, TGI, LangChain and the technology from Intel Extension for Transformers.

GitHub - opea-project/GenAIExamples: Generative AI Examples is a collection of GenAI examples such...

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project. - opea-project/G...

github.com

1

19

90

Haihao Shen

@HaihaoShen

6 months

👨‍💻If you missed CES 2024 Intel copilot demo, no worry, here is the video. 🎯Features: 1) run on your PC for copilot chat, so it's 100% free and safe; 2) run on server for code generation, so it may generate better code; 3) smart model switch. VS plugin is coming🚀 #intelai @intel

2

16

85

Haihao Shen

@HaihaoShen

4 months

🩷A memorable day: Intel Neural Compressor and Intel Extension for Transformers crossed! A baby Neural Speed is on board!!🌟

0

6

81

Haihao Shen

@HaihaoShen

4 months

💣Happy to announce INT4 NeuralChat-7B models available on @huggingface , powered by SOTA INT4 algorithm developed by Intel, yet compatible with AutoGPTQ! 🤗 🤗 📔Paper: 🎯Sample code:

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

0

15

82

Haihao Shen

@HaihaoShen

6 months

📢INT4 GPTQ and RTN landed in ONNX Runtime through Intel Neural Compressor. AI on PC is coming! 📔PR: Thanks to Yuwen, Mengni, and Yufeng! 🌟Code: #intelai #onnxruntime #neuralcompressor

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

github.com

1

15

81

Haihao Shen

@HaihaoShen

9 months

🔥Happy to publish the code of SignRound (a leading INT4 quantization method) : 📕Paper: 👉Code: 📢Leave a star if you find it's useful.

0

22

78

Haihao Shen

@HaihaoShen

7 months

🎁Happy to announce Intel Extension for Transformers supports INT8 quantization for MSFT Phi, making Phi inference more efficient and accessible than ever! 📔Quick guide: 🎯Code available: #iamintel #intelai @intel @huggingface

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

0

14

77

Haihao Shen

@HaihaoShen

7 months

It has been a great experience to see the rapid growth of LLMs in open-source community. We are proud to see @intelai created LLMs & datasets are welcomed and being used/discussed/improved. Go, Intel LLMs!

Intel AI

@IntelAI

7 months

Congrats to Intel team members Haihao Shen and Kaokao Lv for their fine-tuned version of Mistral 7B having hit the top of the list on the @huggingface LLM leaderboard last week: Fine-tuned on 8x Intel Gaudi2 Accelerators.

2

12

119

3

14

73

Haihao Shen

@HaihaoShen

3 months

🔥Want to use FP8 inference easily? Intel Neural Compressor is your best choice: 🎯Shared with you our MLSys'24 camera-ready paper: Efficient Post-Training Quantization with FP8 Formats 🤗 @_akhaliq @navikm @huggingface #IAmIntel

0

18

75

Haihao Shen

@HaihaoShen

8 months

📢"Efficient LLM Inference on CPUs" featured on @Marktechpost ! 🪧 👉Project: #oneapi @intel @huggingface

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

4

12

75

Haihao Shen

@HaihaoShen

3 months

🎯Thrilled to announce INT4 LLM inference on CPUs landed in @LangChainAI . Thanks to @baga_tur and @j_schottenstein ! 📓PR: 🤗INT4 inference powered by and #IAmIntel

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

1

8

75

Haihao Shen

@HaihaoShen

7 months

🎯Wanted to enable audio in your chatbot? Just few minutes. 📕Here is a guide for you, including ASR, TTS, audio processing, audio streaming, multi-lang EN & CN: 📢Optimized code: with🤗models #iamintel #intelai @intel @huggingface

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

1

21

74

Haihao Shen

@HaihaoShen

4 months

🎯How MX data types work for LLMs? New quantization recipes validated by Intel using Neural Compressor, HW architecture and data types proposed by MSFT and defined by OCP 📢Here is a tutorial: with source code publicly available in

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

github.com

0

11

69

Haihao Shen

@HaihaoShen

3 months

🌟Happy to announce Intel Extension for Transformers v1.4 released with a lot of improvements in building GenAI applications on Intel platforms! 🎯Check out the release notes: 🤗 @intel + @huggingface = one of the best GenAI platforms

Release Intel® Extension for Transformers v1.4 Release · intel/intel-extension-for-transformers

Highlights Features Productivity Examples Bug Fixing Highlights AutoRound is SOTA weight-only quantization (WOQ) algorithm for low-bit LLM inference on typical LLMs. This release includes support ...

github.com

0

10

74

Haihao Shen

@HaihaoShen

6 months

🚀Happy to support "upstage/SOLAR-10.7B-Instruct-v1.0" in Intel Extension for Transformers! @upstageai @hunkims . INT4 inference is available with one parameter change from "load_in_8bit" to "load_in_4bit". 📢Next one will be Zephyr🙌 👇Check out the sample code and give a try!

0

13

72

Haihao Shen

@HaihaoShen

4 months

📢Intel Extension for Transformers () supports INT4 and low-bit inference on both CPUs and GPUs! 📔Simple usage guide: 🔥All your need is to get an Intel GPU and run LLMs @huggingface 🤗

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

3

11

72

Haihao Shen

@HaihaoShen

6 months

📢When AI meets cybersecurity, see how Intel NeuralChat LLM helps here. Happy to share a nice blog "Harnessing the Intel NeuralChat 7B Model for Advanced Fraud Detection". Congrats @Saminusalisu ! 🎯Check out the details: #intelai #iamintel @humaabidi

Harnessing the Intel NeuralChat 7B Model for Advanced Fraud Detection: Bilic's AI-Driven Approach...

NeuralChat 7B and Bilic's pioneering customization LONDON, UK / ACCESSWIRE / January 2, 2024 / London,UK / With the thriving presence of artificial intelligence in our lives, Bilic, a cybersecurity...

finance.yahoo.com

1

18

71

Haihao Shen

@HaihaoShen

5 months

🎁Here is a tutorial on how to optimize natural language embedding model and extend LangChain to enable the optimizations. Check out more details: 🤗Code: . Star the project if you find this is useful. 🌟Happy Chinese New Year! 🎇

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

0

14

67

Haihao Shen

@HaihaoShen

3 months

👨‍💻2023 is year of open LLMs. Is it time to predict for 2024? DM your thoughts. 📢Re-share the blog from @clefourrier : , incl. Intel NeuralChat-7B and DPO dataset😀 🤗We hope to contribute more to open-source LLM community in 2024! #iamintel @huggingface

2023, year of open LLMs

huggingface.co

4

12

66

Haihao Shen

@HaihaoShen

4 months

🔥Happy to announce Intel Extension for Transformers v1.3.2 released 📔Release notes: 🎯Highlights: enable popular serving e.g., @huggingface TGI, vLLM, and Triton to build highly efficient chatbot on Intel platforms such as Gaudi2 with a few lines of code

Release Intel® Extension for Transformers v1.3.2 Release · intel/intel-extension-for-transformers

Highlights Support NeuralChat-TGI serving with Docker (8ebff39) Support Neuralchat-vLLM serving with Docker (1988dd) Support SQL generation in NeuralChat (098aca7) Enable llava mmmu evaluation on ...

github.com

0

9

65

Haihao Shen

@HaihaoShen

2 years

🥳Happy to share with you the Intel optimizations for Diffusers textual inversion and the fine-tuning demo of Stable Diffusion on Spaces! 👉 Intel optimizations: 🎯Spaces: 🤗Thanks to Patrick, @anton_lozhkov @_akhaliq from HF!

Add examples with Intel optimizations by hshen14 · Pull Request #1579 · huggingface/diffusers

Per comments from #1499, this PR is to add examples with Intel optimizations for fine-tuning and inference. Bfloat16 fine-tuning is enabled for textual_inversion, while Bfloat16 inference is genera...

github.com

1

21

64

Haihao Shen

@HaihaoShen

16 days

🆕Thrilled to share with you that highly-efficient 4-bit kernels have been integrated into AutoAWQ. Congrats Penghui and thanks @casper_hansen_ for the review! Now, you can run AutoAWQ super fast on CPU with PR: 🔥Base code from

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

1

15

64

Haihao Shen

@HaihaoShen

6 months

📢Intel Copilot in CES 2024 automatically created a Chatbot for the event! Watch the video of Great Minds keynote: delivered by Intel leaders!! 🎯The copilot is built on top of . The code/ext will be released soon. Stay tuned!🚀

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

0

8

64

Haihao Shen

@HaihaoShen

8 months

🪧LLM Leaderboard continues upgrading. Our engineering submission is currently ranking as Top-1 7B fine-tuned LLM. Remember to enable "Show gated/private/deleted models". 🤗Top-1 7B Model: 📷Code: #oneapi @intel @huggingface

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

1

13

64

Haihao Shen

@HaihaoShen

26 days

🔥Thrilled to announce OPEA v0.6 released, featuring 10 micro-service components (e.g., embedding, LLMs), 4 GenAI examples, and 1-click K8S deployment. 🎯Github: Give a try and create GenAI app with your private data! @NoWayYesWei @humaabidi @melisevers

Release Generative AI Examples v0.6 Release Notes · opea-project/GenAIExamples

OPEA Highlights Add 4 MegaService examples: CodeGen, ChatQnA, CodeTrans and Docsum, you can deploy them on Kubernetes Enable 10 microservices for LLM, RAG, security...etc Support text generation, ...

github.com

2

11

62

Haihao Shen

@HaihaoShen

3 months

🤗Want to build an enterprise-grade RAG system? Efficient embedding is what you want. Here is a nice blog from Intel and @huggingface friends on "Intel Fast Embedding" with and #IAmIntel @MosheWasserblat

GitHub - intel/neural-compressor: SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity;...

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime - intel/neural-compressor

github.com

0

15

62

Haihao Shen

@HaihaoShen

4 months

📢Exciting news! Stable Diffusion on Gaudi!! We released Intel Extension for Transformers to simplify LLM fine-tuning and accelerate LLM inference further🚀

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

Stability AI

@StabilityAI

4 months

In this installment of "Behind the Compute", a series dedicated to offering insights for others to harness the power of generative AI, we compared the training speed of @Intel Gaudi 2 accelerators versus @Nvidia 's A100 and H100 for two of our models. (1/3)

17

62

305

1

12

60

Haihao Shen

@HaihaoShen

5 months

🎯Qwen is the default model in INT4 inference sample code on Intel GPUs. Check the main page. 📔Sample code: 🤗

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

1

8

57

Haihao Shen

@HaihaoShen

5 months

📢 v0.2 released

GitHub - intel/neural-speed: An innovative library for efficient LLM inference via low-bit quanti...

An innovative library for efficient LLM inference via low-bit quantization - intel/neural-speed

github.com

0

8

55

Haihao Shen

@HaihaoShen

2 months

🔥If you want the best 4-bit models e.g., Phi-3, Mistral, Solar, Intel Neural Compressor with AutoRound is your choice!🌟

Haihao Shen

@HaihaoShen

2 months

⚡️Leaderboard snapshot on 2024/5/10. INC quantized models using AutoRound () outperform the models (publicly available on HuggingFace) quantized by other popular quantization approaches such as GPTQ, AWQ, GGUF, etc. Pick up the best INT4 model for your use!

1

5

39

1

15

55

Haihao Shen

@HaihaoShen

8 months

📢We are hiring full-time interns for efficient LLM inference. 🔥Group: Intel/DCAI/AISE 🎯Location: Shanghai, Zizhu 😀Working projects: * INC: * ITREX: If you are interested in LLM compression and inference, DM with your resume.😀

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your...

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ - intel/intel-extension-for-transformers

github.com

3

11

52