George Grigorev Profile Banner
George Grigorev Profile
George Grigorev

@iamgrigorev

Followers
1,798
Following
659
Media
1,506
Statuses
6,992

fine-tuning service @ together ai

London
Joined June 2012
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@iamgrigorev
George Grigorev
2 months
I have first results to share of my re-implementation of Apple work on training specific LoRA's on top of small LM to perform specific tasks, for example summarization. Github: Huggingface:
1
1
9
@iamgrigorev
George Grigorev
4 months
gpt4-turbo costs $30/1M tokens, mixtral22b costs $1.2/1M tokens. Quality is basically the same. The level of democratization is unbelievable.
24
71
849
@iamgrigorev
George Grigorev
5 months
I have implemented Mixture-of-Depths and it shows significant memory reduction during training and 10% speed increase. I will verify if it achieves the same quality with 12.5% active tokens. thanks @haeggee for initial code
Tweet media one
6
52
364
@iamgrigorev
George Grigorev
4 months
I have tried schedule_free optimizer from FAIR: The quality looks similar, but look at the variations of metric! It's very smooth, so the training is also very stable. Same learning rate I believe this is a big deal for LLM pre-training.
Tweet media one
4
9
109
@iamgrigorev
George Grigorev
4 months
fireworks offer for $0.9/1M even, crazy
0
0
32
@iamgrigorev
George Grigorev
4 months
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction very interesting!!! Train multi-scale vq-vae, predict whole residual to the vq-vae token matrix in 1 step of decoding
Tweet media one
0
0
24
@iamgrigorev
George Grigorev
4 months
@LevanKvirkvelia on coding and math tasks benchmarks show that it's better than cmd-r+. for other tasks no instruct version is here yet. So gpt4 level quality
2
0
16
@iamgrigorev
George Grigorev
5 months
seems that it's working. pink line is the MoD run.
Tweet media one
1
0
14
@iamgrigorev
George Grigorev
4 years
@norpadon у тебя мнение правильное, но ты уже отстал от жизни)) за 7 лет многое в мышлении и в самом процессе егэ поменялось. Чтобы подготовиться недостаточно просто читать учебник, ты его итак на уроках и после них читаешь) Но репетиторы необязательная часть подготовки к экзамену.
2
0
9
@iamgrigorev
George Grigorev
4 months
In order to use it, you have to add optimizer.eval() and optimizer.train() in your code in train/eval sections and disable your regular scheduler. It supports warmup steps as well. I have usage example in my research repo:
1
0
9
@iamgrigorev
George Grigorev
6 months
@levelsio @ryan_huang_1 try modal with cached zero booting, works like a charm with <1s start time.
3
1
8
@iamgrigorev
George Grigorev
4 months
ordered knob keyboard now my productivity will skyrocket.. in Q3...
Tweet media one
2
0
9
@iamgrigorev
George Grigorev
4 months
I published the newly released CodeQwen1 1.5 7B () to the Ollama! ollama run thepowerfuldeez/code_qwen:7b "Write python script to compute fibonacci numbers"
0
1
9
@iamgrigorev
George Grigorev
4 months
cooking llama3 x open_hermes. This would potentially remove censorship. Will try ORPO afterwards axolotl ❤️
Tweet media one
0
0
8
@iamgrigorev
George Grigorev
5 months
UPD: applying sigmoid to the router logits before multiplying to topk processed tokens, as stated in the paper
1
1
8
@iamgrigorev
George Grigorev
4 months
@yar_vol I’ve tried it the moment it arrived. Vibe check is successful! However I can’t compare on the same tasks without instruct version :)
1
0
7
@iamgrigorev
George Grigorev
4 months
there's a nice concept in perfomance optimization, that reminded me of @UnslothAI 's recent gradient accumulation trick with async offloading. It's called recursive doubling 🧵
1
0
6
@iamgrigorev
George Grigorev
1 year
@raycastapp lmao let's gooo!!!!!!
1
0
6
@iamgrigorev
George Grigorev
3 months
Finally the day I could mute all ai influencers in twitter, as they pop up from the latest big release in my feed :))
0
0
6
@iamgrigorev
George Grigorev
4 months
mixture-of-experts with asymetric experts From 9 experts, 4 are full size, 4 are 0.25 size, 1 is identity function (mixture-of-depths). This could be easily configured with mlp_width
3
0
5
@iamgrigorev
George Grigorev
4 years
@arkadiygershman Брелоки)
0
0
5
@iamgrigorev
George Grigorev
1 year
what makes me, me: compassion to technologies architecture making coffee high tech food industry vegan exploring everywhere deep thinker finding new ways to old problems high achiever digital detox love playful discomfort yoga vipassana scooter riding in asia
@caffeinum
caffeinum.eth @ Toronto
1 year
what makes me, me: - Ukraine - software - lsd - alan watts - 10 hour streams of uninterrupted train frontcam footage - LED RGB lights - making coffee - "women power" songs - blockchain and tech - sofas are great - amateur rapping and singing - English language as a lifelong
0
0
10
1
0
5
@iamgrigorev
George Grigorev
5 months
@Teknium1 @haeggee in theory we should get shorter seq_len every other decoder layer, so with 12.5% capacity_factor overall attention complexity is 75% lower but most of the complexity lies in MLPs :(
0
0
5
@iamgrigorev
George Grigorev
9 years
@wylsacom подпись своя в конце)
0
0
4
@iamgrigorev
George Grigorev
5 months
I have made all the runs for my experiments with LLMs public:
1
0
5
@iamgrigorev
George Grigorev
1 year
@toyxyz3 this is so consistent, probably one of the most so far
0
0
4
@iamgrigorev
George Grigorev
4 years
Tweet media one
1
0
5
@iamgrigorev
George Grigorev
4 months
Wanna build usable RAG application using Cmd-R+ model. Currently downloading whole arxiv, then process all documents, store vectors, build retriever. After that, use LLM providers to run model with retrieved documents and perform grounding
1
0
5
@iamgrigorev
George Grigorev
4 months
Update on Mixture-of-Depths performance. Time to achieve 10B tokens: - With MoD: 47.3h - Without MoD: 55.3h Speed boost: 17% As you can see on plots, quality degrades in average compared to baseline. Although when compared avg of Piqa/Arc_easy/Sciq – no difference
Tweet media one
3
0
4
@iamgrigorev
George Grigorev
7 months
@skalskip92 This can be huge for data labeling from videos
1
0
4
@iamgrigorev
George Grigorev
4 months
@juniorro16 Yep, it’s not made for chat, just like a base version that does completion. You can prompt gpt3 style it
1
0
4
@iamgrigorev
George Grigorev
4 months
@m_elhoushi @AkshatS07 @bilgeacun @bwasti @Ahhegazy77 @BeidiChen @CarolejeanWu very clever idea to use self-speculative decoding! and great engineering effort to make it all work
0
0
4
@iamgrigorev
George Grigorev
4 months
Okay, converted and pushed GGUF version to the HF hub here: And pushed that to ollama: ollama run thepowerfuldeez/code_qwen:7b-base '<PRE> def compute_gcd(x, y): <SUF>return result <MID>'
2
1
4
@iamgrigorev
George Grigorev
2 months
When I develop project locally & remotely I have 2 problems: - For some reason I don’t get the same GitHub key in 2 diff machines, so when I commit I show as 2 different persons - If I push from local + work at remote, I need to sync and override changes - why this hasn’t fixed?
1
0
4
@iamgrigorev
George Grigorev
8 years
Tweet media one
0
0
3
@iamgrigorev
George Grigorev
5 months
Another relevant paper – use LOMO optimizer that is based on SGD. Authors say 1) SGD is enough for LLM pretraining because loss surface doesn't have large curvature (problem of SGD in general) and 2) local optimum is good enough (this is fine-tune related)
0
0
4
@iamgrigorev
George Grigorev
5 years
Tested out MEISAI app, looks cool
1
0
4
@iamgrigorev
George Grigorev
7 months
Since I'm unemployed now, I will start releasing some open-source work related to Multimodal LLMs!
0
0
4
@iamgrigorev
George Grigorev
2 months
DPO loss modified with SFT objective on chosen prompts to increase probability of responses Next, newly proposed RPO (Reward-aware pref. opt.). Helps mitigate overfitting and undertraining on high quality rejected responses. Difference is scaled by reward difference. 300k data
1
0
3
@iamgrigorev
George Grigorev
2 months
Here’s my take on how Apple intelligence works in iOS 18: Semantic index is used as RAG provider for small LLM with function calling behavior: 1. Apple bought several companies that do semantic indexing, providing Siri with realtime data, now this integrated across all personal
1
0
5
@iamgrigorev
George Grigorev
1 year
@Yampeleg for the whole time of reading this tweet had a strong feeling that this text is generated
0
0
3
@iamgrigorev
George Grigorev
6 months
So new small LLaVA-like model is trained for 45% of total steps on 1 GPU (RTX4090) and I am already seeing good progress in terms of MMMU score. For comparison: 1.8B: 27.9% 7B : 33.1% 34B: 44.7%
2
0
3
@iamgrigorev
George Grigorev
4 months
Stocks were crazy yesterday, glad I sold all of NVDA and SMCI 1 month ago.
Tweet media one
1
0
3
@iamgrigorev
George Grigorev
7 months
Yesterday I benchmarked LLaVA1.6-Mistral I got 31.6% on MMMU without beam search and 32.8% with num_beams=5. It's noticeably higher however inference time & memory reqs are 5x higher too...
1
0
3
@iamgrigorev
George Grigorev
4 months
@reach_vb wow! really wanna test it locally instead of github copilot
1
0
3
@iamgrigorev
George Grigorev
2 months
@reach_vb currently training reproduction of apple summarization lora for small llm (using Qwen2-1.5B). Trained a couple of variations on my synthetic preference dataset, will update my profile with dataset and models soon!
1
0
3
@iamgrigorev
George Grigorev
4 months
Okay so I've tested Twinny extension for VSCode with code qwen 1.5, but this llm doesn't support code completion :(
2
0
3
@iamgrigorev
George Grigorev
4 years
Можно действительно часть вещей больше почти не покупать, а продолжать выращивать дома. Тот же лук. Это все я у себя проращиваю сейчас
2
0
3
@iamgrigorev
George Grigorev
4 months
@TheZachMueller great! i've put 2x4090 into ASUS ROG Maximus Z690 for $500 with 1500W corsair HX1500i and case Fractal Design Torrent. Have you been able to run fp8 on your setup?
1
0
3
@iamgrigorev
George Grigorev
4 years
а еще я скоро хочу делать гидропонную систему для выращивания растений как тут описывается (3 тип). Обо всем буду сообщать офк!
0
0
3
@iamgrigorev
George Grigorev
7 years
Оформил Tinkoff Black в долларах, встретился с Севой, все классно
Tweet media one
0
0
3
@iamgrigorev
George Grigorev
7 months
I thought that latest LLaVA repository lacks some easy to start code, so I made small one-file utility library with instruction on how to run latest LLaVA-1.6 models! You can access it here:
1
0
3
@iamgrigorev
George Grigorev
5 months
Found this library which could enhance speed of training a lot. It uses Mixture-of-Experts, Mixture-of-Attention heads. Combined with recent Mixture-of-Depths I think I could get really good performance
0
0
3
@iamgrigorev
George Grigorev
6 months
@stablequan Nice! I could compare with this work soon.
0
0
3
@iamgrigorev
George Grigorev
1 year
minecraft with mods trying something new only because it’s “new” easily attached to other ppl no headphones in the city raves and hard electronic music dominant dogs
0
0
3
@iamgrigorev
George Grigorev
7 months
It's very cool that I could still run 34B multimodal LLMs with num_beams=3 on 24G RTX4090 in 4bit without loss of quality. And I guess I could do fine-tuning with qlora and deepspeeed mode 3 as well
0
0
3
@iamgrigorev
George Grigorev
6 months
Released dataset from LLaVA only has 665k entries. ShareGPT4V paper has enhanced subset. Authors also added DocVQA, ChartQA, DVQA, AI2D and laion/gpt4v-dataset but didn't share code for processing. This was long day with crumbling data, increased instruct ds 665k->944k
1
0
3
@iamgrigorev
George Grigorev
2 months
Update on Summarization LoRA for Qwen2-1.5B: SFT works better than DPO (and consumes a lot less memory at training, I might be actually run 4096-8192 context on 2x4090) I just took 'chosen' key in dpo dataset and re-run experiment BERTScore 0.389 vs 0.269
0
0
3
@iamgrigorev
George Grigorev
6 months
There is a new library for extreme quantization in town! allows for 2bit quantizing with almost the same quality as for 4bit. It's quite easy to try as all code and instructions are here, and interface is the same as for HF models. I will try it
1
0
3
@iamgrigorev
George Grigorev
10 years
Сегодня чаек попивал в кабинете англ http://t.co/vFmUrdP9hp
0
0
0
@iamgrigorev
George Grigorev
8 years
Лол
Tweet media one
0
0
2
@iamgrigorev
George Grigorev
4 months
new era
Tweet media one
0
0
3
@iamgrigorev
George Grigorev
2 months
ok I plan to re-implement work of Apple for making small llm perform various tasks, such as summarization. I heard people tried phi3-small and it performs poorly after fine-tuning. What about qwen2?
0
0
3
@iamgrigorev
George Grigorev
2 months
@hiddentao So sorry this is happened for you. In which area do you live? If you don’t want to specify that’s ok, maybe just north, or southeast will be fine
0
0
1
@iamgrigorev
George Grigorev
8 years
я божила
Tweet media one
0
0
1
@iamgrigorev
George Grigorev
4 years
У меня копится много мыслей, хочу делиться в твиттере
0
0
2
@iamgrigorev
George Grigorev
9 years
0
0
2
@iamgrigorev
George Grigorev
4 years
Блин вообще непросто делать Спринг-роллы. Кто там говорил что это простая быстрая закуска)))
Tweet media one
0
0
2
@iamgrigorev
George Grigorev
4 months
@how_uhh Actually when I tried to increase lr I started getting nans So this is still underexplored, however with exp_sq and momentum I doubt it makes much difference
0
0
2
@iamgrigorev
George Grigorev
4 years
@silyutinaolga @Grah0x_ линейка чтобы показать что 2см от края стола, это красиво класть на стол чтобы показать этикетные нормы и масштаб)
0
0
2
@iamgrigorev
George Grigorev
6 years
btw стрим потом нарежу и будет контент на канал спустя 2 (3?) года неактивности ) Так что еще будет 3 анонса, чтобы хоть минимально расшевелить 3 моих подписчика )
1
0
2
@iamgrigorev
George Grigorev
3 years
смотрите, у оксимирона кол-во строк с каждым новым версусом расло (13-17 года)
Tweet media one
0
0
2
@iamgrigorev
George Grigorev
4 years
И все для того, чтобы играть в Minecraft с RTX
1
0
2
@iamgrigorev
George Grigorev
10 years
@DendiBoss KICK FUNN1K TIME
0
0
2
@iamgrigorev
George Grigorev
9 years
Моя прелесть! C++, Python, C#, Android, Java, HTML/CSS
Tweet media one
0
1
1
@iamgrigorev
George Grigorev
3 months
How to debug systems 2x faster, if it involves distributed training, remote code, and working with gpus?
3
0
2
@iamgrigorev
George Grigorev
6 months
@vikhyatk you should name your runs :D
0
0
2
@iamgrigorev
George Grigorev
10 months
@yaminfouzi @levelsio Looks great! Do you have link at airbnb?
0
0
2
@iamgrigorev
George Grigorev
4 years
@silyutinaolga Свежие
Tweet media one
0
0
2
@iamgrigorev
George Grigorev
6 months
huggingface is down almost an hour now... feels like a whole aws is down in ml world...
0
0
2
@iamgrigorev
George Grigorev
8 years
Яндекс транспорт вообще топовое приложение. Перехватил автобус в последние секунды, увидев его на карте.
0
0
2
@iamgrigorev
George Grigorev
11 years
Это вин :) http://t.co/wl1vLoesEO
Tweet media one
1
0
2
@iamgrigorev
George Grigorev
4 months
@ram_chandalada looks great! for me it seems that just bitlinear is more promising approach than mod+bitlinear!
0
0
1
@iamgrigorev
George Grigorev
4 years
Жду контейнеры из икеи, а потом пойду в лес в 5 минутах от дома собирать червей в плодородной почве, буду делать компостер!
1
0
2
@iamgrigorev
George Grigorev
9 months
Smth is baking
Tweet media one
1
0
2
@iamgrigorev
George Grigorev
2 months
Siri doesn’t have world knowledge and reasoning abilities, because it’s still a small LLM trained to perform actions and respond to short queries. This will improve in the future
1
0
2
@iamgrigorev
George Grigorev
1 year
@rachelclif no I usually express love in a very straight and fastest way possible cause instead I would feel fatigued by long haul of undisclosed communication
0
0
2
@iamgrigorev
George Grigorev
7 months
@vikhyatk but relu dies out quickly no? More research needed. Imo gelu fused implementation works as fast as relu. I generally think the next big thing is to determine a way to use any function in each layer through NAS
1
0
2
@iamgrigorev
George Grigorev
7 months
Also I plan to implement LLaVA-MoE approach, as I've seen one work in this domain. I believe approach with many small image agents could achieve performance of larger image llms. Here's the paper: Mixture-of-Experts for Large Vision-Language Models
0
0
2
@iamgrigorev
George Grigorev
4 months
Ollama is so smooth, so easy to setup to start running local llms. It's like a git v3 (v2 is huggingface, haha). I really like the concept of Modelfile and how easy gguf models gets integrated
0
0
2
@iamgrigorev
George Grigorev
10 years
А вк - тупое лаганное чмо, сука. Будьте прокляты, mail.ru!!!
0
1
2
@iamgrigorev
George Grigorev
10 years
С великим днем победы, ребята!
0
1
2
@iamgrigorev
George Grigorev
4 months
I will now test 50% capacity factor. Probably for my small LM this reduction of context actually makes noticeable difference
0
0
2
@iamgrigorev
George Grigorev
11 years
Всем советую фильм "Иллюзия обмана"
0
0
2
@iamgrigorev
George Grigorev
2 months
Basically Apple let you add agentic behavior to your apps by adding new App Intents and take actions. It’s a very smart choice, because it’s not LLM who decides to take actions, but a developer! Super excited about new Siri!
0
0
2
@iamgrigorev
George Grigorev
12 years
Доброй ночи, а The Noise утра с: http://t.co/Risq2T5D
0
1
2
@iamgrigorev
George Grigorev
8 years
Купить микроволновку, холодильник и роутер в один день - Mission Accomplished
0
0
2
@iamgrigorev
George Grigorev
8 years
@just_NS @4funProd @v1lat все там отлично реализовано. Это профессиональный софт. Может просто ты не процессионал?) Для любителей вегас есть
0
0
2