Georgi Gerganov @ggerganov profile

Georgi Gerganov

@ggerganov

Followers

44K

Following

3K

Media

251

Statuses

1K

24th at the Electrica puzzle challenge | https://t.co/baTQS2bdia

Joined May 2015

Don't wanna be here? Send us removal request.

Georgi Gerganov

@ggerganov

2 years

Introducing LLaMA voice chat! 🦙 . You can run this locally on an M1 Pro

187

1K

8K

Georgi Gerganov

@ggerganov

1 year

Casually running a 180B parameter LLM on M2 Ultra

77

388

4K

Georgi Gerganov

@ggerganov

2 years

I've started a company: From a fun side project just a few months ago, ggml has now become a useful library and framework for machine learning with a great open-source community.

144

378

3K

Georgi Gerganov

@ggerganov

2 years

LLaMA voice chat + Siri TTS. This example is now truly 100% offline since we are now using the built-in Siri text-to-speech available on MacOS through the "say" command

43

365

2K

Georgi Gerganov

@ggerganov

8 months

llama.cpp is now in Homebrew Core 🍺

31

229

2K

Georgi Gerganov

@ggerganov

1 year

Full F16 precision 34B Code Llama at >20 t/s on M2 Ultra

39

261

2K

Georgi Gerganov

@ggerganov

2 years

ggtag : data-over-sound is back !. Please checkout our latest geeky side project --.An e-paper badge that can be programmed with sound. Here is how it works 🔊

33

251

2K

Georgi Gerganov

@ggerganov

1 year

sam.cpp 👀. Inference of Meta's Segment Anything Model on the CPU. Project by @YavorGI - powered by

35

277

2K

Georgi Gerganov

@ggerganov

2 years

guys it’s real

45

64

2K

Georgi Gerganov

@ggerganov

2 years

The future of on-device inference is ggml + Apple Silicon. You heard it here first!.

Nat Friedman

@natfriedman

2 years

Watching llama.cpp do 40 tok/s inference of the 7B model on my M2 Max, with 0% CPU usage, and using all 38 GPU cores. Congratulations @ggerganov ! This is a triumph.

38

180

2K

Georgi Gerganov

@ggerganov

11 months

Causally running Grok-1 at home

76

159

2K

Georgi Gerganov

@ggerganov

2 years

Simultaneously running LLaMA-7B (left) + Whisper Small (right) on M1 Pro

29

178

1K

Georgi Gerganov

@ggerganov

1 year

Let’s see what this rock can do

50

27

1K

Georgi Gerganov

@ggerganov

2 years

Announcing the Local LLaMA podcast 🎙️🦙. In today's episode we have LLaMA, GGaMA, SSaMA and RRaMA joining us to discuss the future of AI

31

187

1K

Georgi Gerganov

@ggerganov

1 year

Adding support for the new Mixtral models. Runs on CPU, CUDA and Metal with quantization support and partial GPU offloading. Very interesting architecture to play with!.

25

146

1K

Georgi Gerganov

@ggerganov

1 year

Wrote a short tutorial for setting up llama.cpp on AWS instances. For example, you can use one of the cheapest 16GB VRAM (NVIDIA T4) instances to serve a quantum Mistral 7B model to multiple clients in parallel with full context. Hope it is useful!.

28

172

1K

Georgi Gerganov

@ggerganov

11 days

Make your Mac think faster 🧠🧠. Tomorrow I'll show you how to cancel your copilot subscription.

Georgi Gerganov

@ggerganov

12 days

Make your Mac think 🧠. Tomorrow I'll show you how to enable speculative decoding for extra speed.

34

123

1K

Georgi Gerganov

@ggerganov

1 year

ggml will soon run on billion devices. @apple don't sleep on it 🙃.

Radoslav Gerganov

@rgerganov

1 year

I just verified this on my Pixel 8 Pro phone! It has AICore included and it is using ggml

61

125

1K

Georgi Gerganov

@ggerganov

1 year

Native whisper.cpp server with OAI-like API is now available. $ make server && ./server. This is a very convenient way to run an efficient local transcription service locally on any kind of hardware (CPU, GPU (CUDA or Metal) or ANE). thx felrock

25

150

1K

Georgi Gerganov

@ggerganov

1 year

llama.cpp server now support multimodal (LLaVA) 🎉. Huge shoutout to FSSRepo and monatis.

16

134

1K

Georgi Gerganov

@ggerganov

1 year

👀 What is this black magic!?

22

135

1K

Georgi Gerganov

@ggerganov

2 years

Just added support for all LLaMA models. I'm out of disk space, so if someone can give this a try for 33B and 65BB would be great 😄.See updated instructions in the Readme. Here is LLaMA-13B at ~10 tokens/s

Georgi Gerganov

@ggerganov

2 years

I think I can make 4-bit LLaMA-65B inference run on a 64 GB M1 Pro 🤔. Speed should be somewhere around 2 tokens/sec. Is this useful for anything?.

26

139

1K

Georgi Gerganov

@ggerganov

2 years

llama.cpp just got access to the new Copilot for Pull Request technical preview by @github . Just add tags like "copilot:all" / "copilot:summary" / "copilot:walkthrough" to your PR comment the magic happens 🪄

15

95

980

Georgi Gerganov

@ggerganov

2 years

The llama.cpp repo is buzzing with activity today. Here are some highlights. Added Alpaca model support and usage instructions

18

72

944

Georgi Gerganov

@ggerganov

12 days

Make your Mac think 🧠. Tomorrow I'll show you how to enable speculative decoding for extra speed.

17

83

967

Georgi Gerganov

@ggerganov

2 years

llama2.c running in a web-page. Compiled with Emscripten and modified the code to predict one token per render pass. The page auto-loads 50MB of model data - sorry about that 😄.

Andrej Karpathy

@karpathy

2 years

My fun weekend hack: llama2.c 🦙🤠.Lets you train a baby Llama 2 model in PyTorch, then inference it with one 500-line file with no dependencies, in pure C. My pretrained model (on TinyStories) samples stories in fp32 at 18 tok/s on my MacBook Air M1 CPU.

16

144

886

Georgi Gerganov

@ggerganov

1 year

Here is how to deploy and serve any LLM on HF with a single command in less than 3 minutes with llama.cpp. $ bash -c "$(curl -s "

8

125

854

Georgi Gerganov

@ggerganov

2 years

llama.cpp now supports distributed inference across multiple devices via MPI. This is possible thanks to @EvMill's work. Looking for people to give this a try and attempt to run a 65B LLaMA on cluster of Raspberry Pis 🙃.

19

137

855

Georgi Gerganov

@ggerganov

2 years

whisper.cpp v1.3.0 now with Core ML support. Currently, the Encoder runs on the ANE, while the Decoder remains on the CPU. Check the linked PR 566 for implementation details and usage instructions.

12

118

758

Georgi Gerganov

@ggerganov

2 years

Here is 4-bit inference of LLaMA-7B using ggml:. Pure C/C++, runs on the CPU at 20 tokens/sec (M1 Pro). Generated text looks coherent, but quickly degrades - not sure if I have a bug or something 🤔. Anyway, LLaMA-65B on M1 coming soon!.

24

129

734

Georgi Gerganov

@ggerganov

1 year

Running some LLM benches on iPhone 13 Mini. This is 1.1B TinyLlama. Speed looks quite reasonable. Wonder what would be some cool applications that we can try out 🤔. P.S. Forget about useless chat bots - we want something else. Think grammar, function calling, etc.

49

67

725

Georgi Gerganov

@ggerganov

11 months

llama.cpp releases now ship with pre-built macOS binaries. This should reduce the entry barrier for llama.cpp on Apple devices. Thanks to @huggingface for the friendly support 🙏

16

67

724

Georgi Gerganov

@ggerganov

2 years

I'm thinking about making an open-source local iOS voice chat app running Whisper Base + 4-bit Cerebras-GPT 2.7B. Should be able to run quite real-time on newer iPhones. Pretty sure I have everything needed and can build this in a day. Only question is if Cerebras is good enough.

42

40

723

Georgi Gerganov

@ggerganov

2 years

Apparently, Stable Diffusion can be used to generate images of spectrograms from text prompts. The spectrograms can in turn be converted to audio using STFT and some tricks. Mind is blown!.

18

123

663

Georgi Gerganov

@ggerganov

1 year

Experimenting with speculative decoding + grammar sampling. This is an example of summarizing a short story into a structured JSON. We again utilize speculative decoding, but this time we constrain the output using a JSON grammar to achieve > 95% token acceptance rate

11

67

663

Georgi Gerganov

@ggerganov

1 year

M2 Ultra serving Q8_0 LLaMA-v2 70B to 4 clients in parallel

16

67

638

Georgi Gerganov

@ggerganov

2 years

Top quality post on r/LocalLLaMA today 😅.Btw, great subreddit!

9

53

630

Georgi Gerganov

@ggerganov

2 years

shower thought : drop the position embeddings, rewrite the transformer using complex numbers, encode the position information in the complex phase. ref : see how MRI phase encoding works.

31

25

620

Georgi Gerganov

@ggerganov

1 year

Run @Google's Gemma Open Models with llama.cpp.

21

75

629

Georgi Gerganov

@ggerganov

1 year

Serving 8 clients in parallel on A100 with llama.cpp. Model: Codellama 7B F16.System prompt: 305 tokens.Requests: 128.Max sequence length: 100.Continuous batching: enabled. Average speed ~484 t/s (including prompts and generated tokens)

17

63

592

Georgi Gerganov

@ggerganov

1 year

whisper.cpp v1.5.0.

15

72

585

Georgi Gerganov

@ggerganov

1 month

Open the pod bay doors, HAL.

21

41

608

Georgi Gerganov

@ggerganov

1 year

llama.cpp is standing ground against the behemoths. The CUDA backend is contained in a single C++ file so it allows for very easy deployment and custom modifications. (pp - prefill, tg - text gen)

anton

@abacaj

1 year

Trying out the new TensorRT-LLM framework and get some pretty good performance out of the box with 3090s. 107 tokens/sec int8 and 54 tok/sec bf16 for llama-2 7B models (not much work to setup either). Get 160+ tokens/sec on 2x3090s (these are just batch_size=1)

12

47

570

Georgi Gerganov

@ggerganov

2 years

2,3,4,5 and 6-bit quantization methods are now available in llama.cpp. Efficient inference implementation with ARM NEON, AVX2 and CUDA - see sample numbers in the screenshots.Big thanks to ikawrakow for this contribution. More info:.

13

75

564

Georgi Gerganov

@ggerganov

1 year

Full GPU Metal inference with whisper.cpp. This is the Medium model on M2 Ultra, greedy decoding

15

53

562

Georgi Gerganov

@ggerganov

10 months

Challenge accepted! 😀.

Awni Hannun

@awnihannun

10 months

Achievement unlocked:. 100 tokens-per-sec, 4-bit Mistral 7B in MLX on an M2 Ultra

11

34

561

Georgi Gerganov

@ggerganov

11 months

The GGUF file format is a great example of the cool things that an open-source community can achieve. Props to @philpax_ and everyone else involved in the design and implementation of the format. I'm thankful and happy to see that it finds adoption in ML.

Mishig Davaadorj

@mishig25

11 months

At @huggingface, we are adding more support to GGUF (model format by @ggerganov). The number of GGUF models on the hub has been exploding & doesn't look like it is gonna slow down🔥.see more at:

11

65

502

Georgi Gerganov

@ggerganov

3 months

ggml inference tech making its way into this week’s @apple M4 announcements is a great testament to this. IMO, Apple Silicon continues to be the best consumer-grade hardware for local AI applications. For next year, they should move copilot on-device.

Georgi Gerganov

@ggerganov

2 years

The future of on-device inference is ggml + Apple Silicon. You heard it here first!.

15

47

547

Georgi Gerganov

@ggerganov

2 years

Initial low-rank adaptation support has been added to llama.cpp. We now have the option to apply LoRA adapters to a base model at runtime. Lots of room for improvements and opens up possibilities for some interesting applications.

9

82

540

Georgi Gerganov

@ggerganov

3 months

llama.vim : Neovim plugin for local text completion . (powered by llama.cpp)

24

70

542

Georgi Gerganov

@ggerganov

1 year

Here are some inference numbers for Code Llama on M2 Ultra at different quantum levels using latest llama.cpp . pp - prompt processing.tg - text generation. Code Llama 7B

12

60

535

Georgi Gerganov

@ggerganov

1 year

The ggml roadmap is progressing as expected with a lot of infrastructural development already completed. We now enter the more interesting phase of the project - applying the framework to practical problems and doing cool stuff on the Edge

Georgi Gerganov

@ggerganov

2 years

Took the time to prepare a ggml development roadmap in the form of a Github Project. This sets the priorities for the short/mid term and will offer a good way for everyone to keep track of the progress that is being made across related projects

7

41

522

Georgi Gerganov

@ggerganov

2 years

Can't help but feel the AI hype is oriented in a non-optimal direction. It's almost as if we had just discovered the FFT algorithm and instead of revolutionizing telecommunications, we are using it to build Tamagotchis. P.S. I'm only half joking 😄.

31

32

512

Georgi Gerganov

@ggerganov

2 years

Took the time to prepare a ggml development roadmap in the form of a Github Project. This sets the priorities for the short/mid term and will offer a good way for everyone to keep track of the progress that is being made across related projects

10

38

489

Georgi Gerganov

@ggerganov

2 years

I'm trying to figure out what this means. Any ideas?

39

32

497

Georgi Gerganov

@ggerganov

2 years

New release: whisper.cpp v1.4. - Added 4-bit, 5-bit and 8-bit integer quantization.- Added partial GPU support via cuBLAS.

11

65

501

Georgi Gerganov

@ggerganov

2 years

whisper.cpp now supports @akashmjn's tinydiarize models. These fine-tuned models offer experimental support for speaker segmentation by introducing special tokens for marking speaker changes.

16

64

501

Georgi Gerganov

@ggerganov

2 years

Progress update on adding Core ML support to whisper.cpp. We can now run the small model with a 400ms time step quite efficiently thanks to evaluating the Encoder on the ANE

11

43

479

Georgi Gerganov

@ggerganov

2 years

Interactive chat mode added to 🦙.cpp. It actually works surprisingly well from the few tests that I tried!. Kindly contributed by GH user Blackhole89

12

44

471

Georgi Gerganov

@ggerganov

1 year

Some of llama.cpp's features.

13

36

475

Georgi Gerganov

@ggerganov

1 year

Initial tests with parallel decoding in llama.cpp. A simulated server processing 64 client requests with 32 decoding streams on M2 Ultra. Supports hot-plugging of new sequences. Model is 30B LLaMA F16. ~4000 tokens (994 prompt + 3001 gen) with system prompt of 305 tokens in 46s

16

52

463

Georgi Gerganov

@ggerganov

2 years

Will be cancelling my Github Copilot subscription soon 🙃.

9

33

452

Georgi Gerganov

@ggerganov

10 days

llama.vscode. (powered by Qwen Coder)

Georgi Gerganov

@ggerganov

11 days

Make your Mac think faster 🧠🧠. Tomorrow I'll show you how to cancel your copilot subscription.

12

73

481

Georgi Gerganov

@ggerganov

8 months

LBDL + llama for scale. thx @francoisfleuret

4

13

454

Georgi Gerganov

@ggerganov

2 years

Here is what a properly built llama.cpp looks like. Running 7B on 2 years old Pixel 5 at 1 token/sec. Would be interesting to see how an interactive session feels like.

Radoslav Gerganov

@rgerganov

2 years

Running llama.cpp on my Pixel5 phone with termux. Kudos to @ggerganov !

10

67

448

Georgi Gerganov

@ggerganov

10 months

GGUF My Repo by @huggingface . Create quantum GGUF models fully online - quickly and secure. Thanks to @reach_vb, @pcuenq and team for creating this HF space!. In the video below I give it a try to create a quantum 8-bit model of Gemma 2B - it took about

24

89

456

Georgi Gerganov

@ggerganov

4 months

Llama 3.2 3B & 1B GGUF.

25

85

452

Georgi Gerganov

@ggerganov

1 year

Very clever stuff! Will be adding a llama.cpp example soon.

lmarena.ai (formerly lmsys.org)

@lmarena_ai

1 year

Introduce lookahead decoding:.- a parallel decoding algo to accelerate LLM inference.- w/o the need for a draft model or a data store.- linearly decreases # decoding steps relative to log(FLOPs) used per decoding step. Blog: Code:

7

39

450

Georgi Gerganov

@ggerganov

1 year

ROCm support in llama.cpp. 4 months community effort enables AMD devices to run quantum LLMs with high efficiency. Really great to see the strong collaboration in this work!.

11

64

441

Georgi Gerganov

@ggerganov

2 years

I think I can make 4-bit LLaMA-65B inference run on a 64 GB M1 Pro 🤔. Speed should be somewhere around 2 tokens/sec. Is this useful for anything?.

37

17

442

Georgi Gerganov

@ggerganov

2 years

The plan for adding full-fledged GPU support in ggml is starting to take shape. Today I finally finished the ggml computation graph export / import functionality and demonstrated a basic MNIST inference on the Apple Silicon GPU using Metal.

8

64

431

Georgi Gerganov

@ggerganov

2 years

I'm color-coding Whisper tokens based on their probs -- green means confident. All models behave in a similar way (first 3 images), except for Large V2. The probs are all over the place (4th image) 🤔. Do I have a bug or is this model somehow unstable?

15

27

438

Georgi Gerganov

@ggerganov

2 years

4-bit integer quantisation in whisper.cpp / ggml. You can now run the Large Whisper model locally in a web page via WebAssembly SIMD.

11

65

435

Georgi Gerganov

@ggerganov

2 years

Very cool experiment by @chillgates_ . Distributed MPI inference using llama.cpp with 6 Raspberry Pis - each one with 8GB RAM "sees" 1/6 of the entire 65B model. Inference starts around ~1:10. Follow the progress here:.

Loki (cute/acc)

@chillgates_

2 years

Yeah. I have ChatGPT at home. Not a silly 7b model. A full-on 65B model that runs on my pi cluster, watch how the model gets loaded across the cluster with mmap and does round-robin inferencing 🫡 (10 seconds/token) (sped up 16x)

11

73

431

Georgi Gerganov

@ggerganov

2 years

This is the prompt for anyone interested:.

8

51

419

Georgi Gerganov

@ggerganov

2 years

napkin math ahead:. - buy 8 mac mini (200GB/s, ~$1.2k each).- run LLAMA_METAL=1 LLAMA_MPI=1 for interleaved pipeline inference.- deploy on-premise, serve up to 8 clients in parallel at 25 t/s / 4-bit / 7B. is this cost efficient? energy wise?. thanks to @stanimirovb for idea.

24

26

408

Georgi Gerganov

@ggerganov

2 years

The new image segmentation model SAM by Meta looks extremely interesting.

16

14

406

Georgi Gerganov

@ggerganov

1 year

"inference on your head".

Joseph Semrai

@josephsemrai

1 year

inference on your head. mistral 7b (4bit quantized) running locally on apple vision pro

5

32

401

Georgi Gerganov

@ggerganov

1 year

This is LLaVA 7B v1.5 running on M2 Ultra thanks to the amazing work of GH user monatis. I'm surprised this works so well - downloaded a few photos from my phone and every single one was accurately described. Mind is blown!.

4

36

402

Georgi Gerganov

@ggerganov

1 year

"Wait, Georgi, how is this even possible?" you might ask. After all, the M2 Ultra only has 800GB/s bandwidth. Other people normally need 4 high-end GPUs to do this. The answer is: Speculative Sampling.

9

37

395

Georgi Gerganov

@ggerganov

2 years

RWKV port in ggml by the community:. I haven't had the chance to look at this in details yet, but it feels great that people are picking up ggml and applying it to more and more models.

3

53

382

Georgi Gerganov

@ggerganov

3 months

llama.vim is also pretty wild 🙃

Thomas Ricouard

@Dimillian

3 months

Yep GitHub Copilot for Xcode is pretty wild!

7

43

377

Georgi Gerganov

@ggerganov

2 years

Here I outline a potential strategy for adding GPU support to ggml. Not sure how feasible it is yet, but it could be a fun exercise for people with GPU programming experience.

6

47

372

Georgi Gerganov

@ggerganov

2 years

Powered by: ggml / whisper.cpp / llama.cpp / Core ML .STT: Whisper Small.LLM: 13B LLaMA.TTS: @elevenlabsio . The Whisper Encoder is running on Apple Neural Engine. Everything else is optimized via ARM NEON and Apple Accelerate.

10

18

363

Georgi Gerganov

@ggerganov

1 year

Playing some chess using voice. WASM whisper.cpp with a quantized tiny model + grammar sampling (by @ejones). Runs locally in the browser. Not perfect, but I think pretty good overall!. Try it here:

8

43

356

Georgi Gerganov

@ggerganov

1 year

70B home assistant running on M2 Ultra at 15 t/s. I can now cancel my ChatGPT API subscription

19

23

346

Georgi Gerganov

@ggerganov

2 years

To run the released model with latest llama.cpp, use the "convert-unversioned-ggml-to-ggml" python script and apply the following patch to llama.cpp. The latest llama.cpp offers significant performance and accuracy improvements in the inference computation

Andriy Mulyar

@andriy_mulyar

2 years

I'm excited to announce the release of GPT4All, a 7B param language model finetuned from a curated set of 400k GPT-Turbo-3.5 assistant-style generation. We release💰800k data samples💰 for anyone to build upon and a model you can run on your laptop!.Real-time Sampling on M1 Mac

6

47

344

Georgi Gerganov

@ggerganov

1 year

Here are some Apple Silicon stats for llama.cpp. You can use these numbers to estimate the performance that you would get on your Mac for typical bs=1 cases.

10

50

345

Georgi Gerganov

@ggerganov

2 years

So, someone just DM'd me on twitter a patch that improves the inference time by 10% on ARM NEON (i.e. Apple Silicon). Probably more people should go there and optimise this stuff .

5

26

343

Georgi Gerganov

@ggerganov

2 years

Added a ggml example for using Cerebras-GPT. I think the sampling needs some work because I can't get it to generate coherent stuff yet. Using quantized 6.7B model. Here is the code and usage instructions if you want to play with it: .

7

38

341

Georgi Gerganov

@ggerganov

2 years

Lets add support for llama2.c models to llama.cpp.

7

48

338

Georgi Gerganov

@ggerganov

4 months

Try the 4-bit model easily on your Mac (even in EU):

Georgi Gerganov

@ggerganov

4 months

Llama 3.2 3B & 1B GGUF.

10

36

338

Georgi Gerganov

@ggerganov

2 years

600 posts per day actually sound great for me. Some days I do feel I waste too much time here.

11

15

327

Georgi Gerganov

@ggerganov

1 year

Some performance stats for llama.cpp on A-series chips (iPhone / iPad). We are collecting benchmarks for 1B, 3B and 7B models at different quantization levels. Can be used as a reference for the expected LLM performance on these devices.

5

44

323

Georgi Gerganov

@ggerganov

1 year

Code Llama 34B using Q4_K_M quantization on a MacBook.

Roman Janusz

@rjghik

1 year

@abacaj Quantized version running on MBP M2 Max (llama.cpp)

7

32

326

Georgi Gerganov

@ggerganov

2 years

Great write-up. The CoreML branch speeds up just the Encoder. At the same time, the master branch already has additional ~2-3 factor of speed up in the Decoder thanks to recent work on llama.cpp. When we merge these 2 together, the performance will be mind blowing.

Ben Nortier

@bjnortier

2 years

Hello Transcribe 2.2 with CoreML is out, now 3x-7x faster 🚀🥳. Blog post: App Store: #OpenAI #AI #Whisper #CoreML

9

29

319

Georgi Gerganov

@ggerganov

2 years

Some good old airplane-mode programming. No copilot, no voice control, no AR/VR, no AI augmentations, no cybernetic implants. Just VIM and the sunrise

10

7

313

Georgi Gerganov

@ggerganov

4 months

Let's bring llama.cpp to the clouds!. You can now run llama.cpp-powered inference endpoints through Hugging Face with just a few clicks. Simply select a GGUF model, pick your cloud provider (AWS, Azure, GCP), a suitable node GPU/CPU and you are good to go. For more info, check.

Xuan-Son Nguyen

@ngxson

4 months

Wanna see something cool?. You can now deploy GGUF models directly onto Hugging Face Inference Endpoints!. Powered by llama.cpp @ggerganov . Try it now -->

9

45

313

Georgi Gerganov

@ggerganov

2 years

And one more example using Mac OS "say" command

10

21

306

Georgi Gerganov

@ggerganov

1 year

What is the intuition of having the LLaMA layers to be the same size instead of, let's say increasing size: start with small hidden state and keep increasing as you go through the layers?. At layer 1, we have just a single token with no context - 4096 hidden state seems too big.

28

21

308