kourosh hakhamaneshi Profile
kourosh hakhamaneshi

@CyrusHakha

Followers
889
Following
2K
Statuses
707

ML engineer @anyscalecompute 💻 prev PhD, EECS, @UCBerkeley 👨‍🎓

California, USA
Joined September 2010
Don't wanna be here? Send us removal request.
@CyrusHakha
kourosh hakhamaneshi
2 years
🚀 Exploring Llama-2’s Quality: Can we replace generalist GPT-4 endpoints with specialized OSS models? Dive deep with our technical blogpost to understand the nuances and insights of fine-tuning OSS models. 🔗 🧵 Thread 1/N👇
16
117
528
@CyrusHakha
kourosh hakhamaneshi
3 days
RT @askalphaxiv: We used Gemini 2 Flash to build Cursor for arXiv papers Highlight any section of a paper to ask questions and “@” other p…
0
169
0
@CyrusHakha
kourosh hakhamaneshi
3 days
Cursor basically taught Microsoft the true potential of their original copilot concept. The evolution of copilot, before and after the emergence of cursor is like day and night.
@ashtom
Thomas Dohmke
4 days
Today, we are infusing the power of agentic AI into the GitHub Copilot experience, elevating Copilot from pair to peer programmer 🤖 (1/4)
0
0
5
@CyrusHakha
kourosh hakhamaneshi
4 days
RT @robertnishihara: Join our @raydistributed meetup next Thursday at the @BytedanceTalk Bay Area headquarters along with @nvidia. We'll be…
0
5
0
@CyrusHakha
kourosh hakhamaneshi
10 days
We are going global :-)
@anyscalecompute
Anyscale
11 days
Anyscale is expanding to India! We're opening our first international office. Come work with us to get this office off the ground (DM @jaikumarharikoa).
Tweet media one
0
0
1
@CyrusHakha
kourosh hakhamaneshi
10 days
RT @robertnishihara: We're expanding to India and building a small elite team. If you want to be part of the founding team here, DM me.
0
6
0
@CyrusHakha
kourosh hakhamaneshi
13 days
Some of my earlier attempts on llama-1B-instruct did not show similar behaviors i.e they didn’t improve eval metrics and there was no outstanding emerging behavior that I noticed. Many ablations are needed to understand the root cause of these behaviors and open source community is already investigating the impact of these design choices on capability and generalization: particular choice of RL algorithm, size of the initial model, whether it should be instruct tuned or not, mixture of prompts used during RL, reward engineering etc. It’s fascinating to see the power of open source once again.
0
0
0
@CyrusHakha
kourosh hakhamaneshi
13 days
RT @Alibaba_Qwen: The burst of DeepSeek V3 has attracted attention from the whole AI community to large-scale MoE models. Concurrently, we…
0
2K
0
@CyrusHakha
kourosh hakhamaneshi
13 days
RT @vllm_project: 🚀 With the v0.7.0 release today, we are excited to announce the alpha release of vLLM V1: A major architectural upgrade w…
0
98
0
@CyrusHakha
kourosh hakhamaneshi
14 days
I still cannot wrap my head around the fact that pure RL can cause emergent behaviors like self-reflection with use of words such as "hmm, wait..." , "umm". There must be a better explanation in the prior of the base model that RL is applied to.
@johnschulman2
John Schulman
16 days
There are some intriguing similarities between the r1 chains of thought and the o1-preview CoTs shared in papers and blog posts (eg . In particular, note the heavy use of the words "wait" and "alternatively" as a transition words for error correction and double-checking.
0
0
0
@CyrusHakha
kourosh hakhamaneshi
18 days
Reproduction of key ideas in reducing overthinking in reasoning models: Key enabler is a contrastive preference tuning like SimPO algorithm and collecting a small amount of data (10k samples) and a fairly simple pair construction trick.
@NovaSkyAI
NovaSky
18 days
1/5 ⚡️Presenting Sky-T1-32B-Flash⚡️, our open reasoning model that tackles "overthinking" to cut generation lengths (and inference cost!) by 50% without sacrificing accuracy – tuned with only $275! 📊Blog: 🏋️‍♀️Weights:
Tweet media one
0
0
5
@CyrusHakha
kourosh hakhamaneshi
18 days
RT @NovaSkyAI: 1/5 ⚡️Presenting Sky-T1-32B-Flash⚡️, our open reasoning model that tackles "overthinking" to cut generation lengths (and inf…
0
18
0
@CyrusHakha
kourosh hakhamaneshi
29 days
Big take-away from this work is that if you have high quality traces of tree of thought, simple SFT can lift up the reasoning capabilities very effectively. These traces do not just include happy paths of jumping straight to the chain of thought leading to the answer; they also include self-reflection, backtracking etc all in-context effectively teaching the model how to course-correct if it made a mistake. Next step in open source research is how to generate these tree of thought traces independent of a teacher model (qwq in this case). This is the step that requires RL with search combo. Looking forward to working with @NovaSkyAI team on answering some of these questions.
@NovaSkyAI
NovaSky
1 month
1/6 🚀 Introducing Sky-T1-32B-Preview, our fully open-source reasoning model that matches o1-preview on popular reasoning and coding benchmarks — trained under $450! 📊Blog: 🏋️‍♀️Model weights:
Tweet media one
0
0
4
@CyrusHakha
kourosh hakhamaneshi
1 month
RT @NovaSkyAI: 1/6 🚀 Introducing Sky-T1-32B-Preview, our fully open-source reasoning model that matches o1-preview on popular reasoning an…
0
251
0
@CyrusHakha
kourosh hakhamaneshi
1 month
In my opinion, figuring out how to scale the process reward model and how to augment it with expert human experts, has the most weight in figuring our o1's RL training / inference.
@natolambert
Nathan Lambert
1 month
There's a lot of confusion about o1's RL training and the emergence of RL as a popular post-training loss function. Yes, these are the same loss functions and similar data. BUT, the amount of compute used for o1's RL training is much more in line with pretraining. The words we use to describe training are strained already, but o1 may be better viewed as next-token pretraining, rl pretraining, and then some normal post-training.
0
0
1
@CyrusHakha
kourosh hakhamaneshi
2 months
@Yuchenj_UW This is absolutely insane!!
0
0
2
@CyrusHakha
kourosh hakhamaneshi
2 months
Love nice, practical, noiseless work.
@jeremyphoward
Jeremy Howard
2 months
I'll get straight to the point. We trained 2 new models. Like BERT, but modern. ModernBERT. Not some hypey GenAI thing, but a proper workhorse model, for retrieval, classification, etc. Real practical stuff. It's much faster, more accurate, longer context, and more useful. 🧵
Tweet media one
0
0
3
@CyrusHakha
kourosh hakhamaneshi
3 months
@AravSrinivas I think I am personally responsible for three of those 😅😅
0
0
2
@CyrusHakha
kourosh hakhamaneshi
4 months
@ajayj_ Congrats @ajayj_
0
0
2
@CyrusHakha
kourosh hakhamaneshi
4 months
Effectively it’s free now 🙃🙃
@OfficialLoganK
Logan Kilpatrick
4 months
Say hello to Gemini 1.5 Flash-8B ⚡️, now available for production usage with: - 50% lower price (vs 1.5 Flash) - 2x higher rate limits (vs 1.5 Flash) - lower latency on small prompts (vs 1.5 Flash)
0
0
3