George Grigorev @iamgrigorev profile

George Grigorev

@iamgrigorev

Followers

1,798

Following

659

Media

1,506

Statuses

6,992

fine-tuning service @ together ai

London

Joined June 2012

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

The DNC • 1123005 Tweets

Joe Biden • 539534 Tweets

Hillary • 266799 Tweets

#DNCConvention2024 • 136597 Tweets

#loveislandusa • 133715 Tweets

Jill • 103976 Tweets

Andrea • 89354 Tweets

Star Wars • 88235 Tweets

ニンテンドーミュージアム • 75016 Tweets

Jana • 71550 Tweets

Steve Kerr • 52187 Tweets

The Acolyte • 47738 Tweets

Leah • 46163 Tweets

Cruzeiro • 44555 Tweets

Jasmine Crockett • 43812 Tweets

#ThankYouJoe • 42532 Tweets

Serena • 38842 Tweets

Kendall • 37201 Tweets

Vitória • 36277 Tweets

Warnock • 36211 Tweets

Jamie Raskin • 34538 Tweets

LOCK HIM UP • 34210 Tweets

#DNC2024CHICAGO • 33930 Tweets

V IS COMING • 31412 Tweets

LAYOVER VINYL IS COMING • 28567 Tweets

#ファミマの増量チョコ • 26957 Tweets

Ashley Biden • 26477 Tweets

Kaylor • 21814 Tweets

राजीव गांधी • 20793 Tweets

Martin Fierro • 17510 Tweets

Charlottesville • 15273 Tweets

Gwen Walz • 14427 Tweets

ALEXA KNT OnSHOWTIME • 14200 Tweets

Angry Joe • 13222 Tweets

CSPAN • 12746 Tweets

令和の米騒動 • 11284 Tweets

Hadley Duvall • 11241 Tweets

ドラゴンガンダム • 10356 Tweets

ケンタッキー • 10172 Tweets

Dark Brandon

Gavin Stone

요괴워치

天気予報士

ポップン新作

アコライト

आधुनिक भारत

Andor

マンシー

Megyn

キンプリツアー

Last Seen Profiles

@SawyerWickstrom

@sulaimanselamat

@SUNKSnft

@iagon

@spoonScribble

@mariabep

@BelalAlnhari

@anajaureguizar_

@CaptainSKA

@tutor_course

@McGleeshh

@cox_jeb

@SuperMario98

@GomooFrenzCNFT

@_ahmddo

@Rafabenitez740

@javis016

@juanchoguzmans

@ummacorp_fr

@donanimhaber

Pinned Tweet

George Grigorev

@iamgrigorev

2 months

I have first results to share of my re-implementation of Apple work on training specific LoRA's on top of small LM to perform specific tasks, for example summarization. Github: Huggingface:

thepowerfuldeez/Qwen2-1.5B-Summarize · Hugging Face

huggingface.co

1

9

George Grigorev

@iamgrigorev

4 months

gpt4-turbo costs $30/1M tokens, mixtral22b costs $1.2/1M tokens. Quality is basically the same. The level of democratization is unbelievable.

24

71

849

George Grigorev

@iamgrigorev

5 months

I have implemented Mixture-of-Depths and it shows significant memory reduction during training and 10% speed increase. I will verify if it achieves the same quality with 12.5% active tokens. thanks @haeggee for initial code

6

52

364

George Grigorev

@iamgrigorev

4 months

I have tried schedule_free optimizer from FAIR: The quality looks similar, but look at the variations of metric! It's very smooth, so the training is also very stable. Same learning rate I believe this is a big deal for LLM pre-training.

4

9

109

George Grigorev

@iamgrigorev

4 months

fireworks offer for $0.9/1M even, crazy

0

32

George Grigorev

@iamgrigorev

4 months

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction very interesting!!! Train multi-scale vq-vae, predict whole residual to the vq-vae token matrix in 1 step of decoding

0

24

George Grigorev

@iamgrigorev

4 months

@LevanKvirkvelia on coding and math tasks benchmarks show that it's better than cmd-r+. for other tasks no instruct version is here yet. So gpt4 level quality

2

0

16

George Grigorev

@iamgrigorev

5 months

seems that it's working. pink line is the MoD run.

1

0

14

George Grigorev

@iamgrigorev

4 years

@norpadon у тебя мнение правильное, но ты уже отстал от жизни)) за 7 лет многое в мышлении и в самом процессе егэ поменялось. Чтобы подготовиться недостаточно просто читать учебник, ты его итак на уроках и после них читаешь) Но репетиторы необязательная часть подготовки к экзамену.

2

0

9

George Grigorev

@iamgrigorev

4 months

In order to use it, you have to add optimizer.eval() and optimizer.train() in your code in train/eval sections and disable your regular scheduler. It supports warmup steps as well. I have usage example in my research repo:

OLMo/olmo/train.py at 10639d51731ace96254be51a7bd81ff035baee55 · thepowerfuldeez/OLMo

My fork os allen AI's OLMo for educational purposes. - thepowerfuldeez/OLMo

github.com

1

0

9

George Grigorev

@iamgrigorev

6 months

@levelsio @ryan_huang_1 try modal with cached zero booting, works like a charm with <1s start time.

3

1

8

George Grigorev

@iamgrigorev

4 months

ordered knob keyboard now my productivity will skyrocket.. in Q3...

2

0

9

George Grigorev

@iamgrigorev

4 months

I published the newly released CodeQwen1 1.5 7B () to the Ollama! ollama run thepowerfuldeez/code_qwen:7b "Write python script to compute fibonacci numbers"

Qwen/CodeQwen1.5-7B-Chat · Hugging Face

huggingface.co

0

1

9

George Grigorev

@iamgrigorev

4 months

cooking llama3 x open_hermes. This would potentially remove censorship. Will try ORPO afterwards axolotl ❤️

0

8

George Grigorev

@iamgrigorev

5 months

UPD: applying sigmoid to the router logits before multiplying to topk processed tokens, as stated in the paper

OLMo/olmo/mod.py at 80be1a3ff1d4a80167b37a1d97509cc0b54d821d · thepowerfuldeez/OLMo

My fork os allen AI's OLMo for educational purposes. - thepowerfuldeez/OLMo

github.com

1

8

George Grigorev

@iamgrigorev

4 months

@yar_vol I’ve tried it the moment it arrived. Vibe check is successful! However I can’t compare on the same tasks without instruct version :)

1

0

7

George Grigorev

@iamgrigorev

4 months

there's a nice concept in perfomance optimization, that reminded me of @UnslothAI 's recent gradient accumulation trick with async offloading. It's called recursive doubling 🧵

1

0

6

George Grigorev

@iamgrigorev

1 year

@raycastapp lmao let's gooo!!!!!!

1

0

6

George Grigorev

@iamgrigorev

3 months

Finally the day I could mute all ai influencers in twitter, as they pop up from the latest big release in my feed :))

0

6

George Grigorev

@iamgrigorev

4 months

mixture-of-experts with asymetric experts From 9 experts, 4 are full size, 4 are 0.25 size, 1 is identity function (mixture-of-depths). This could be easily configured with mlp_width

3

0

5

George Grigorev

@iamgrigorev

4 years

@arkadiygershman Брелоки)

0

5

George Grigorev

@iamgrigorev

1 year

what makes me, me: compassion to technologies architecture making coffee high tech food industry vegan exploring everywhere deep thinker finding new ways to old problems high achiever digital detox love playful discomfort yoga vipassana scooter riding in asia

caffeinum.eth @ Toronto

@caffeinum

1 year

what makes me, me: - Ukraine - software - lsd - alan watts - 10 hour streams of uninterrupted train frontcam footage - LED RGB lights - making coffee - "women power" songs - blockchain and tech - sofas are great - amateur rapping and singing - English language as a lifelong

0

10

1

0

5

George Grigorev

@iamgrigorev

5 months

@Teknium1 @haeggee in theory we should get shorter seq_len every other decoder layer, so with 12.5% capacity_factor overall attention complexity is 75% lower but most of the complexity lies in MLPs :(

0

5

George Grigorev

@iamgrigorev

9 years

@wylsacom подпись своя в конце)

0

4

George Grigorev

@iamgrigorev

5 months

I have made all the runs for my experiments with LLMs public:

thepowerfuldeez

Weights & Biases, developer tools for machine learning

wandb.ai

1

0

5

George Grigorev

@iamgrigorev

1 year

@toyxyz3 this is so consistent, probably one of the most so far

0

4

George Grigorev

@iamgrigorev

4 years

1

0

5

George Grigorev

@iamgrigorev

4 months

Wanna build usable RAG application using Cmd-R+ model. Currently downloading whole arxiv, then process all documents, store vectors, build retriever. After that, use LLM providers to run model with retrieved documents and perform grounding

1

0

5

George Grigorev

@iamgrigorev

4 months

Update on Mixture-of-Depths performance. Time to achieve 10B tokens: - With MoD: 47.3h - Without MoD: 55.3h Speed boost: 17% As you can see on plots, quality degrades in average compared to baseline. Although when compared avg of Piqa/Arc_easy/Sciq – no difference

3

0

4

George Grigorev

@iamgrigorev

7 months

@skalskip92 This can be huge for data labeling from videos

1

0

4

George Grigorev

@iamgrigorev

4 months

@juniorro16 Yep, it’s not made for chat, just like a base version that does completion. You can prompt gpt3 style it

1

0

4

George Grigorev

@iamgrigorev

4 months

@m_elhoushi @AkshatS07 @bilgeacun @bwasti @Ahhegazy77 @BeidiChen @CarolejeanWu very clever idea to use self-speculative decoding! and great engineering effort to make it all work

0

4

George Grigorev

@iamgrigorev

4 months

Okay, converted and pushed GGUF version to the HF hub here: And pushed that to ollama: ollama run thepowerfuldeez/code_qwen:7b-base '<PRE> def compute_gcd(x, y): <SUF>return result <MID>'

thepowerfuldeez/CodeQwen1.5-7B-GGUF · Hugging Face

huggingface.co

2

1

4

George Grigorev

@iamgrigorev

2 months

When I develop project locally & remotely I have 2 problems: - For some reason I don’t get the same GitHub key in 2 diff machines, so when I commit I show as 2 different persons - If I push from local + work at remote, I need to sync and override changes - why this hasn’t fixed?

1

0

4

George Grigorev

@iamgrigorev

8 years

0

3

George Grigorev

@iamgrigorev

5 months

Another relevant paper – use LOMO optimizer that is based on SGD. Authors say 1) SGD is enough for LLM pretraining because loss surface doesn't have large curvature (problem of SGD in general) and 2) local optimum is good enough (this is fine-tune related)

Full Parameter Fine-tuning for Large Language Models with Limited Resources

Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. Lowering the threshold for LLMs training would encourage greater...

arxiv.org

0

4

George Grigorev

@iamgrigorev

5 years

Tested out MEISAI app, looks cool

1

0

4

George Grigorev

@iamgrigorev

7 months

Since I'm unemployed now, I will start releasing some open-source work related to Multimodal LLMs!

0

4

George Grigorev

@iamgrigorev

2 months

DPO loss modified with SFT objective on chosen prompts to increase probability of responses Next, newly proposed RPO (Reward-aware pref. opt.). Helps mitigate overfitting and undertraining on high quality rejected responses. Difference is scaled by reward difference. 300k data

1

0

3

George Grigorev

@iamgrigorev

2 months

Here’s my take on how Apple intelligence works in iOS 18: Semantic index is used as RAG provider for small LLM with function calling behavior: 1. Apple bought several companies that do semantic indexing, providing Siri with realtime data, now this integrated across all personal

1

0

5

George Grigorev

@iamgrigorev

1 year

@Yampeleg for the whole time of reading this tweet had a strong feeling that this text is generated

0

3

George Grigorev

@iamgrigorev

6 months

So new small LLaVA-like model is trained for 45% of total steps on 1 GPU (RTX4090) and I am already seeing good progress in terms of MMMU score. For comparison: 1.8B: 27.9% 7B : 33.1% 34B: 44.7%

2

0

3

George Grigorev

@iamgrigorev

4 months

Stocks were crazy yesterday, glad I sold all of NVDA and SMCI 1 month ago.

1

0

3

George Grigorev

@iamgrigorev

7 months

Yesterday I benchmarked LLaVA1.6-Mistral I got 31.6% on MMMU without beam search and 32.8% with num_beams=5. It's noticeably higher however inference time & memory reqs are 5x higher too...

1

0

3

George Grigorev

@iamgrigorev

4 months

@reach_vb wow! really wanna test it locally instead of github copilot

1

0

3

George Grigorev

@iamgrigorev

2 months

@reach_vb currently training reproduction of apple summarization lora for small llm (using Qwen2-1.5B). Trained a couple of variations on my synthetic preference dataset, will update my profile with dataset and models soon!

1

0

3

George Grigorev

@iamgrigorev

4 months

Okay so I've tested Twinny extension for VSCode with code qwen 1.5, but this llm doesn't support code completion :(

2

0

3

George Grigorev

@iamgrigorev

4 years

Можно действительно часть вещей больше почти не покупать, а продолжать выращивать дома. Тот же лук. Это все я у себя проращиваю сейчас

2

0

3

George Grigorev

@iamgrigorev

4 months

@TheZachMueller great! i've put 2x4090 into ASUS ROG Maximus Z690 for $500 with 1500W corsair HX1500i and case Fractal Design Torrent. Have you been able to run fp8 on your setup?

1

0

3

George Grigorev

@iamgrigorev

4 years

а еще я скоро хочу делать гидропонную систему для выращивания растений как тут описывается (3 тип). Обо всем буду сообщать офк!

0

3

George Grigorev

@iamgrigorev

7 years

Оформил Tinkoff Black в долларах, встретился с Севой, все классно

0

3

George Grigorev

@iamgrigorev

7 months

I thought that latest LLaVA repository lacks some easy to start code, so I made small one-file utility library with instruction on how to run latest LLaVA-1.6 models! You can access it here:

GitHub - thepowerfuldeez/llava_utils: One-file utils file that helps to access official LLaVA...

One-file utils file that helps to access official LLaVA repository - thepowerfuldeez/llava_utils

github.com

1

0

3

George Grigorev

@iamgrigorev

5 months

Found this library which could enhance speed of training a lot. It uses Mixture-of-Experts, Mixture-of-Attention heads. Combined with recent Mixture-of-Depths I think I could get really good performance

GitHub - myshell-ai/JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars

Reaching LLaMA2 Performance with 0.1M Dollars. Contribute to myshell-ai/JetMoE development by creating an account on GitHub.

github.com

0

3

George Grigorev

@iamgrigorev

6 months

@stablequan Nice! I could compare with this work soon.

0

3

George Grigorev

@iamgrigorev

1 year

minecraft with mods trying something new only because it’s “new” easily attached to other ppl no headphones in the city raves and hard electronic music dominant dogs

0

3

George Grigorev

@iamgrigorev

7 months

It's very cool that I could still run 34B multimodal LLMs with num_beams=3 on 24G RTX4090 in 4bit without loss of quality. And I guess I could do fine-tuning with qlora and deepspeeed mode 3 as well

0

3

George Grigorev

@iamgrigorev

6 months

Released dataset from LLaVA only has 665k entries. ShareGPT4V paper has enhanced subset. Authors also added DocVQA, ChartQA, DVQA, AI2D and laion/gpt4v-dataset but didn't share code for processing. This was long day with crumbling data, increased instruct ds 665k->944k

1

0

3

George Grigorev

@iamgrigorev

2 months

Update on Summarization LoRA for Qwen2-1.5B: SFT works better than DPO (and consumes a lot less memory at training, I might be actually run 4096-8192 context on 2x4090) I just took 'chosen' key in dpo dataset and re-run experiment BERTScore 0.389 vs 0.269

GitHub - thepowerfuldeez/qwen2_1_5b_summarize

Contribute to thepowerfuldeez/qwen2_1_5b_summarize development by creating an account on GitHub.

github.com

0

3

George Grigorev

@iamgrigorev

6 months

There is a new library for extreme quantization in town! allows for 2bit quantizing with almost the same quality as for 4bit. It's quite easy to try as all code and instructions are here, and interface is the same as for HF models. I will try it

GitHub - Vahe1994/AQLM: Official Pytorch repository for Extreme Compression of Large Language...

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf - Vahe1994/AQLM

github.com

1

0

3

George Grigorev

@iamgrigorev

10 years

Сегодня чаек попивал в кабинете англ http://t.co/vFmUrdP9hp

0

George Grigorev

@iamgrigorev

8 years

Лол

0

2

George Grigorev

@iamgrigorev

4 months

new era

0

3

George Grigorev

@iamgrigorev

2 months

ok I plan to re-implement work of Apple for making small llm perform various tasks, such as summarization. I heard people tried phi3-small and it performs poorly after fine-tuning. What about qwen2?

0

3

George Grigorev

@iamgrigorev

2 months

@hiddentao So sorry this is happened for you. In which area do you live? If you don’t want to specify that’s ok, maybe just north, or southeast will be fine

0

1

George Grigorev

@iamgrigorev

8 years

я божила

0

1

George Grigorev

@iamgrigorev

4 years

У меня копится много мыслей, хочу делиться в твиттере

0

2

George Grigorev

@iamgrigorev

9 years

@andreyhobb @RuslanUsachev ее чувашия

0

2

George Grigorev

@iamgrigorev

4 years

Блин вообще непросто делать Спринг-роллы. Кто там говорил что это простая быстрая закуска)))

0

2

George Grigorev

@iamgrigorev

4 months

@how_uhh Actually when I tried to increase lr I started getting nans So this is still underexplored, however with exp_sq and momentum I doubt it makes much difference

0

2

George Grigorev

@iamgrigorev

4 years

@silyutinaolga @Grah0x_ линейка чтобы показать что 2см от края стола, это красиво класть на стол чтобы показать этикетные нормы и масштаб)

0

2

George Grigorev

@iamgrigorev

6 years

btw стрим потом нарежу и будет контент на канал спустя 2 (3?) года неактивности ) Так что еще будет 3 анонса, чтобы хоть минимально расшевелить 3 моих подписчика )

1

0

2

George Grigorev

@iamgrigorev

3 years

смотрите, у оксимирона кол-во строк с каждым новым версусом расло (13-17 года)

0

2

George Grigorev

@iamgrigorev

4 years

И все для того, чтобы играть в Minecraft с RTX

1

0

2

George Grigorev

@iamgrigorev

10 years

@DendiBoss KICK FUNN1K TIME

0

2

George Grigorev

@iamgrigorev

9 years

Моя прелесть! C++, Python, C#, Android, Java, HTML/CSS

0

1

George Grigorev

@iamgrigorev

3 months

How to debug systems 2x faster, if it involves distributed training, remote code, and working with gpus?

3

0

2

George Grigorev

@iamgrigorev

6 months

@vikhyatk you should name your runs :D

0

2

George Grigorev

@iamgrigorev

10 months

@yaminfouzi @levelsio Looks great! Do you have link at airbnb?

0

2

George Grigorev

@iamgrigorev

4 years

@silyutinaolga Свежие

0

2

George Grigorev

@iamgrigorev

6 months

huggingface is down almost an hour now... feels like a whole aws is down in ml world...

0

2

George Grigorev

@iamgrigorev

8 years

Яндекс транспорт вообще топовое приложение. Перехватил автобус в последние секунды, увидев его на карте.

0

2

George Grigorev

@iamgrigorev

11 years

Это вин :) http://t.co/wl1vLoesEO

1

0

2

George Grigorev

@iamgrigorev

4 months

@ram_chandalada looks great! for me it seems that just bitlinear is more promising approach than mod+bitlinear!

0

1

George Grigorev

@iamgrigorev

4 years

Жду контейнеры из икеи, а потом пойду в лес в 5 минутах от дома собирать червей в плодородной почве, буду делать компостер!

1

0

2

George Grigorev

@iamgrigorev

9 months

Smth is baking

1

0

2

George Grigorev

@iamgrigorev

2 months

Siri doesn’t have world knowledge and reasoning abilities, because it’s still a small LLM trained to perform actions and respond to short queries. This will improve in the future

1

0

2

George Grigorev

@iamgrigorev

1 year

@rachelclif no I usually express love in a very straight and fastest way possible cause instead I would feel fatigued by long haul of undisclosed communication

0

2

George Grigorev

@iamgrigorev

7 months

@vikhyatk but relu dies out quickly no? More research needed. Imo gelu fused implementation works as fast as relu. I generally think the next big thing is to determine a way to use any function in each layer through NAS

1

0

2

George Grigorev

@iamgrigorev

7 months

Also I plan to implement LLaVA-MoE approach, as I've seen one work in this domain. I believe approach with many small image agents could achieve performance of larger image llms. Here's the paper: Mixture-of-Experts for Large Vision-Language Models

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Recent advances demonstrate that scaling Large Vision-Language Models (LVLMs) effectively improves downstream task performances. However, existing scaling methods enable all model parameters to be...

arxiv.org

0

2

George Grigorev

@iamgrigorev

4 months

great lecture

Stanford CS25: V1 I Transformer Circuits, Induction Heads, In-Context...

"Neural network parameters can be thought of as compiled computer programs. Somehow, they encode sophisticated algorithms, capable of things no human knows h...

www.youtube.com

0

2

George Grigorev

@iamgrigorev

4 months

Ollama is so smooth, so easy to setup to start running local llms. It's like a git v3 (v2 is huggingface, haha). I really like the concept of Modelfile and how easy gguf models gets integrated

0

2

George Grigorev

@iamgrigorev

10 years

А вк - тупое лаганное чмо, сука. Будьте прокляты, mail.ru!!!

0

1

2

George Grigorev

@iamgrigorev

10 years

С великим днем победы, ребята!

0

1

2

George Grigorev

@iamgrigorev

4 months

I will now test 50% capacity factor. Probably for my small LM this reduction of context actually makes noticeable difference

0

2

George Grigorev

@iamgrigorev

11 years

Всем советую фильм "Иллюзия обмана"

0

2

George Grigorev

@iamgrigorev

2 months

Basically Apple let you add agentic behavior to your apps by adding new App Intents and take actions. It’s a very smart choice, because it’s not LLM who decides to take actions, but a developer! Super excited about new Siri!

0

2

George Grigorev

@iamgrigorev

12 years

Доброй ночи, а The Noise утра с: http://t.co/Risq2T5D

0

1

2

George Grigorev

@iamgrigorev

8 years

Купить микроволновку, холодильник и роутер в один день - Mission Accomplished

0

2

George Grigorev

@iamgrigorev

8 years

@just_NS @4funProd @v1lat все там отлично реализовано. Это профессиональный софт. Может просто ты не процессионал?) Для любителей вегас есть

0

2