kourosh hakhamaneshi @CyrusHakha profile

kourosh hakhamaneshi

@CyrusHakha

Followers

889

Following

2K

Statuses

707

ML engineer @anyscalecompute 💻 prev PhD, EECS, @UCBerkeley 👨‍🎓

California, USA

Joined September 2010

Don't wanna be here? Send us removal request.

kourosh hakhamaneshi

@CyrusHakha

2 years

🚀 Exploring Llama-2’s Quality: Can we replace generalist GPT-4 endpoints with specialized OSS models? Dive deep with our technical blogpost to understand the nuances and insights of fine-tuning OSS models. 🔗 🧵 Thread 1/N👇

16

117

528

kourosh hakhamaneshi

@CyrusHakha

3 days

RT @askalphaxiv: We used Gemini 2 Flash to build Cursor for arXiv papers Highlight any section of a paper to ask questions and “@” other p…

0

169

0

kourosh hakhamaneshi

@CyrusHakha

3 days

Cursor basically taught Microsoft the true potential of their original copilot concept. The evolution of copilot, before and after the emergence of cursor is like day and night.

Thomas Dohmke

@ashtom

4 days

Today, we are infusing the power of agentic AI into the GitHub Copilot experience, elevating Copilot from pair to peer programmer 🤖 (1/4)

0

5

kourosh hakhamaneshi

@CyrusHakha

4 days

RT @robertnishihara: Join our @raydistributed meetup next Thursday at the @BytedanceTalk Bay Area headquarters along with @nvidia. We'll be…

0

5

0

kourosh hakhamaneshi

@CyrusHakha

10 days

We are going global :-)

Anyscale

@anyscalecompute

11 days

Anyscale is expanding to India! We're opening our first international office. Come work with us to get this office off the ground (DM @jaikumarharikoa).

0

1

kourosh hakhamaneshi

@CyrusHakha

10 days

RT @robertnishihara: We're expanding to India and building a small elite team. If you want to be part of the founding team here, DM me.

0

6

0

kourosh hakhamaneshi

@CyrusHakha

13 days

Some of my earlier attempts on llama-1B-instruct did not show similar behaviors i.e they didn’t improve eval metrics and there was no outstanding emerging behavior that I noticed. Many ablations are needed to understand the root cause of these behaviors and open source community is already investigating the impact of these design choices on capability and generalization: particular choice of RL algorithm, size of the initial model, whether it should be instruct tuned or not, mixture of prompts used during RL, reward engineering etc. It’s fascinating to see the power of open source once again.

0

kourosh hakhamaneshi

@CyrusHakha

13 days

RT @Alibaba_Qwen: The burst of DeepSeek V3 has attracted attention from the whole AI community to large-scale MoE models. Concurrently, we…

0

2K

0

kourosh hakhamaneshi

@CyrusHakha

13 days

RT @vllm_project: 🚀 With the v0.7.0 release today, we are excited to announce the alpha release of vLLM V1: A major architectural upgrade w…

0

98

0

kourosh hakhamaneshi

@CyrusHakha

14 days

I still cannot wrap my head around the fact that pure RL can cause emergent behaviors like self-reflection with use of words such as "hmm, wait..." , "umm". There must be a better explanation in the prior of the base model that RL is applied to.

John Schulman

@johnschulman2

16 days

There are some intriguing similarities between the r1 chains of thought and the o1-preview CoTs shared in papers and blog posts (eg . In particular, note the heavy use of the words "wait" and "alternatively" as a transition words for error correction and double-checking.

0

kourosh hakhamaneshi

@CyrusHakha

18 days

Reproduction of key ideas in reducing overthinking in reasoning models: Key enabler is a contrastive preference tuning like SimPO algorithm and collecting a small amount of data (10k samples) and a fairly simple pair construction trick.

NovaSky

@NovaSkyAI

18 days

1/5 ⚡️Presenting Sky-T1-32B-Flash⚡️, our open reasoning model that tackles "overthinking" to cut generation lengths (and inference cost!) by 50% without sacrificing accuracy – tuned with only $275! 📊Blog: 🏋️‍♀️Weights:

0

5

kourosh hakhamaneshi

@CyrusHakha

18 days

RT @NovaSkyAI: 1/5 ⚡️Presenting Sky-T1-32B-Flash⚡️, our open reasoning model that tackles "overthinking" to cut generation lengths (and inf…

0

18

0

kourosh hakhamaneshi

@CyrusHakha

29 days

Big take-away from this work is that if you have high quality traces of tree of thought, simple SFT can lift up the reasoning capabilities very effectively. These traces do not just include happy paths of jumping straight to the chain of thought leading to the answer; they also include self-reflection, backtracking etc all in-context effectively teaching the model how to course-correct if it made a mistake. Next step in open source research is how to generate these tree of thought traces independent of a teacher model (qwq in this case). This is the step that requires RL with search combo. Looking forward to working with @NovaSkyAI team on answering some of these questions.

NovaSky

@NovaSkyAI

1 month

1/6 🚀 Introducing Sky-T1-32B-Preview, our fully open-source reasoning model that matches o1-preview on popular reasoning and coding benchmarks — trained under $450! 📊Blog: 🏋️‍♀️Model weights:

0

4

kourosh hakhamaneshi

@CyrusHakha

1 month

RT @NovaSkyAI: 1/6 🚀 Introducing Sky-T1-32B-Preview, our fully open-source reasoning model that matches o1-preview on popular reasoning an…

0

251

0

kourosh hakhamaneshi

@CyrusHakha

1 month

In my opinion, figuring out how to scale the process reward model and how to augment it with expert human experts, has the most weight in figuring our o1's RL training / inference.

Nathan Lambert

@natolambert

1 month

There's a lot of confusion about o1's RL training and the emergence of RL as a popular post-training loss function. Yes, these are the same loss functions and similar data. BUT, the amount of compute used for o1's RL training is much more in line with pretraining. The words we use to describe training are strained already, but o1 may be better viewed as next-token pretraining, rl pretraining, and then some normal post-training.

0

1

kourosh hakhamaneshi

@CyrusHakha

2 months

@Yuchenj_UW This is absolutely insane!!

0

2

kourosh hakhamaneshi

@CyrusHakha

2 months

Love nice, practical, noiseless work.

Jeremy Howard

@jeremyphoward

2 months

I'll get straight to the point. We trained 2 new models. Like BERT, but modern. ModernBERT. Not some hypey GenAI thing, but a proper workhorse model, for retrieval, classification, etc. Real practical stuff. It's much faster, more accurate, longer context, and more useful. 🧵

0

3

kourosh hakhamaneshi

@CyrusHakha

3 months

@AravSrinivas I think I am personally responsible for three of those 😅😅

0

2

kourosh hakhamaneshi

@CyrusHakha

4 months

@ajayj_ Congrats @ajayj_

0

2

kourosh hakhamaneshi

@CyrusHakha

4 months

Effectively it’s free now 🙃🙃

Logan Kilpatrick

@OfficialLoganK

4 months

Say hello to Gemini 1.5 Flash-8B ⚡️, now available for production usage with: - 50% lower price (vs 1.5 Flash) - 2x higher rate limits (vs 1.5 Flash) - lower latency on small prompts (vs 1.5 Flash)

0

3