Vinija Jain @VinijaJain profile

Vinija Jain

@VinijaJain

Followers

415

Following

182

Statuses

106

ML at Meta • I write about Machine Learning, NLP, and Recommender Systems • Stanford AI • ex Amazon, Oracle, PANW

Cupertino

Joined July 2023

Don't wanna be here? Send us removal request.

Vinija Jain

@VinijaJain

8 months

Thank you @Analyticsindiam for the feature!

0

1

5

Vinija Jain

@VinijaJain

23 minutes

🧐 Demystifying Long Chain-of-Thought (CoT) Reasoning in LLMs Authors: Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig , and Xiang Yue 🔗 🔹 This paper caught my attention, especially after all the buzz around DeepSeek AI's R1. This paper takes a deep dive into how large language models (LLMs) develop long Chain-of-Thought (CoT) reasoning—the kind of structured problem-solving that allows for backtracking, self-correction, and exploring multiple solutions. The authors systematically analyze what drives models to generate longer, more sophisticated reasoning chains and provide some key insights into training strategies. 🔹 One of the biggest takeaways is that Supervised Fine-Tuning (SFT) on long CoT data significantly improves RL training. While not strictly necessary, it makes reinforcement learning (RL) much more effective, allowing models to scale their reasoning abilities. Short CoTs tend to plateau early, while long CoTs continue improving with more data. 🔹 However, simply applying RL to extend CoT length doesn’t always work. The authors show that RL training for long CoT is often unstable, with models either failing to extend reasoning length or artificially inflating responses without meaningful reasoning. To address this, they introduce a Cosine Length-Scaling Reward, which stabilizes length growth while preserving reasoning quality. They also find that reward hacking is a major issue—models will start repeating phrases to maximize length-based rewards rather than improving their reasoning. This is mitigated using an n-gram repetition penalty, ensuring that added steps actually contribute to problem-solving. 🔹 Another key challenge is the need for high-quality verifiable reward signals. Traditional RL rewards rely on ground-truth answers, but such data is scarce. The authors explore using noisy, web-extracted (“silver”) supervision data and find that, with proper filtering, it significantly improves out-of-distribution generalization. 🔹 One of the most interesting findings is that long CoT reasoning isn’t entirely emergent—many core abilities, like error correction and branching, are already present in base models. RL doesn’t create these from scratch but guides the model toward using them more effectively. Written in collab with @i_amanchadha

0

Vinija Jain

@VinijaJain

1 hour

🧐 Demystifying Long Chain-of-Thought (CoT) Reasoning in LLMs Authors: Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig , and Xiang Yue 🔗 🔹 This paper caught my attention, especially after all the buzz around DeepSeek AI's R1. This paper takes a deep dive into how large language models (LLMs) develop long Chain-of-Thought (CoT) reasoning—the kind of structured problem-solving that allows for backtracking, self-correction, and exploring multiple solutions. The authors systematically analyze what drives models to generate longer, more sophisticated reasoning chains and provide some key insights into training strategies. 🔹 One of the biggest takeaways is that Supervised Fine-Tuning (SFT) on long CoT data significantly improves RL training. While not strictly necessary, it makes reinforcement learning (RL) much more effective, allowing models to scale their reasoning abilities. Short CoTs tend to plateau early, while long CoTs continue improving with more data. 🔹 However, simply applying RL to extend CoT length doesn’t always work. The authors show that RL training for long CoT is often unstable, with models either failing to extend reasoning length or artificially inflating responses without meaningful reasoning. To address this, they introduce a Cosine Length-Scaling Reward, which stabilizes length growth while preserving reasoning quality. They also find that reward hacking is a major issue—models will start repeating phrases to maximize length-based rewards rather than improving their reasoning. This is mitigated using an n-gram repetition penalty, ensuring that added steps actually contribute to problem-solving. 🔹 Another key challenge is the need for high-quality verifiable reward signals. Traditional RL rewards rely on ground-truth answers, but such data is scarce. The authors explore using noisy, web-extracted (“silver”) supervision data and find that, with proper filtering, it significantly improves out-of-distribution generalization. 🔹 One of the most interesting findings is that long CoT reasoning isn’t entirely emergent—many core abilities, like error correction and branching, are already present in base models. RL doesn’t create these from scratch but guides the model toward using them more effectively.

0

Vinija Jain

@VinijaJain

1 hour

🧐 Demystifying Long Chain-of-Thought (CoT) Reasoning in LLMs Authors: Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig , and Xiang Yue 🔗 🔹 This paper caught my attention, especially after all the buzz around DeepSeek AI's R1. This paper takes a deep dive into how large language models (LLMs) develop long Chain-of-Thought (CoT) reasoning—the kind of structured problem-solving that allows for backtracking, self-correction, and exploring multiple solutions. The authors systematically analyze what drives models to generate longer, more sophisticated reasoning chains and provide some key insights into training strategies. 🔹 One of the biggest takeaways is that Supervised Fine-Tuning (SFT) on long CoT data significantly improves RL training. While not strictly necessary, it makes reinforcement learning (RL) much more effective, allowing models to scale their reasoning abilities. Short CoTs tend to plateau early, while long CoTs continue improving with more data. 🔹 However, simply applying RL to extend CoT length doesn’t always work. The authors show that RL training for long CoT is often unstable, with models either failing to extend reasoning length or artificially inflating responses without meaningful reasoning. To address this, they introduce a Cosine Length-Scaling Reward, which stabilizes length growth while preserving reasoning quality. They also find that reward hacking is a major issue—models will start repeating phrases to maximize length-based rewards rather than improving their reasoning. This is mitigated using an n-gram repetition penalty, ensuring that added steps actually contribute to problem-solving. 🔹 Another key challenge is the need for high-quality verifiable reward signals. Traditional RL rewards rely on ground-truth answers, but such data is scarce. The authors explore using noisy, web-extracted (“silver”) supervision data and find that, with proper filtering, it significantly improves out-of-distribution generalization. 🔹 One of the most interesting findings is that long CoT reasoning isn’t entirely emergent—many core abilities, like error correction and branching, are already present in base models. RL doesn’t create these from scratch but guides the model toward using them more effectively.

0

2

3

Vinija Jain

@VinijaJain

1 hour

🧐 Demystifying Long Chain-of-Thought (CoT) Reasoning in LLMs Authors: Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig , and Xiang Yue 🔗 🔹 This paper caught my attention, especially after all the buzz around DeepSeek AI's R1. This paper takes a deep dive into how large language models (LLMs) develop long Chain-of-Thought (CoT) reasoning—the kind of structured problem-solving that allows for backtracking, self-correction, and exploring multiple solutions. The authors systematically analyze what drives models to generate longer, more sophisticated reasoning chains and provide some key insights into training strategies. 🔹 One of the biggest takeaways is that Supervised Fine-Tuning (SFT) on long CoT data significantly improves RL training. While not strictly necessary, it makes reinforcement learning (RL) much more effective, allowing models to scale their reasoning abilities. Short CoTs tend to plateau early, while long CoTs continue improving with more data. 🔹 However, simply applying RL to extend CoT length doesn’t always work. The authors show that RL training for long CoT is often unstable, with models either failing to extend reasoning length or artificially inflating responses without meaningful reasoning. To address this, they introduce a Cosine Length-Scaling Reward, which stabilizes length growth while preserving reasoning quality. They also find that reward hacking is a major issue—models will start repeating phrases to maximize length-based rewards rather than improving their reasoning. This is mitigated using an n-gram repetition penalty, ensuring that added steps actually contribute to problem-solving. 🔹 Another key challenge is the need for high-quality verifiable reward signals. Traditional RL rewards rely on ground-truth answers, but such data is scarce. The authors explore using noisy, web-extracted (“silver”) supervision data and find that, with proper filtering, it significantly improves out-of-distribution generalization. 🔹 One of the most interesting findings is that long CoT reasoning isn’t entirely emergent—many core abilities, like error correction and branching, are already present in base models. RL doesn’t create these from scratch but guides the model toward using them more effectively.

0

1

Vinija Jain

@VinijaJain

1 hour

🧐 Demystifying Long Chain-of-Thought (CoT) Reasoning in LLMs Authors: Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig , and Xiang Yue 🔗 🔹 This paper caught my attention, especially after all the buzz around DeepSeek AI's R1. This paper takes a deep dive into how large language models (LLMs) develop long Chain-of-Thought (CoT) reasoning—the kind of structured problem-solving that allows for backtracking, self-correction, and exploring multiple solutions. The authors systematically analyze what drives models to generate longer, more sophisticated reasoning chains and provide some key insights into training strategies. 🔹 One of the biggest takeaways is that Supervised Fine-Tuning (SFT) on long CoT data significantly improves RL training. While not strictly necessary, it makes reinforcement learning (RL) much more effective, allowing models to scale their reasoning abilities. Short CoTs tend to plateau early, while long CoTs continue improving with more data. 🔹 However, simply applying RL to extend CoT length doesn’t always work. The authors show that RL training for long CoT is often unstable, with models either failing to extend reasoning length or artificially inflating responses without meaningful reasoning. To address this, they introduce a Cosine Length-Scaling Reward, which stabilizes length growth while preserving reasoning quality. They also find that reward hacking is a major issue—models will start repeating phrases to maximize length-based rewards rather than improving their reasoning. This is mitigated using an n-gram repetition penalty, ensuring that added steps actually contribute to problem-solving. 🔹 Another key challenge is the need for high-quality verifiable reward signals. Traditional RL rewards rely on ground-truth answers, but such data is scarce. The authors explore using noisy, web-extracted (“silver”) supervision data and find that, with proper filtering, it significantly improves out-of-distribution generalization. 🔹 One of the most interesting findings is that long CoT reasoning isn’t entirely emergent—many core abilities, like error correction and branching, are already present in base models. RL doesn’t create these from scratch but guides the model toward using them more effectively.

0

1

Vinija Jain

@VinijaJain

1 hour

🧐 Demystifying Long Chain-of-Thought (CoT) Reasoning in LLMs Authors: Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig , and Xiang Yue 🔗 🔹 This paper caught my attention, especially after all the buzz around DeepSeek AI's R1. This paper takes a deep dive into how large language models (LLMs) develop long Chain-of-Thought (CoT) reasoning—the kind of structured problem-solving that allows for backtracking, self-correction, and exploring multiple solutions. The authors systematically analyze what drives models to generate longer, more sophisticated reasoning chains and provide some key insights into training strategies. 🔹 One of the biggest takeaways is that Supervised Fine-Tuning (SFT) on long CoT data significantly improves RL training. While not strictly necessary, it makes reinforcement learning (RL) much more effective, allowing models to scale their reasoning abilities. Short CoTs tend to plateau early, while long CoTs continue improving with more data. 🔹 However, simply applying RL to extend CoT length doesn’t always work. The authors show that RL training for long CoT is often unstable, with models either failing to extend reasoning length or artificially inflating responses without meaningful reasoning. To address this, they introduce a Cosine Length-Scaling Reward, which stabilizes length growth while preserving reasoning quality. They also find that reward hacking is a major issue—models will start repeating phrases to maximize length-based rewards rather than improving their reasoning. This is mitigated using an n-gram repetition penalty, ensuring that added steps actually contribute to problem-solving. 🔹 Another key challenge is the need for high-quality verifiable reward signals. Traditional RL rewards rely on ground-truth answers, but such data is scarce. The authors explore using noisy, web-extracted (“silver”) supervision data and find that, with proper filtering, it significantly improves out-of-distribution generalization. 🔹 One of the most interesting findings is that long CoT reasoning isn’t entirely emergent—many core abilities, like error correction and branching, are already present in base models. RL doesn’t create these from scratch but guides the model toward using them more effectively.

0

1

Vinija Jain

@VinijaJain

1 hour

🧐 Demystifying Long Chain-of-Thought (CoT) Reasoning in LLMs Authors: Edward Yeo, Yuxuan Tong, Morry Niu, @gneubig , and Xiang Yue 🔗 🔹 This paper caught my attention, especially after all the buzz around DeepSeek AI's R1. This paper takes a deep dive into how large language models (LLMs) develop long Chain-of-Thought (CoT) reasoning—the kind of structured problem-solving that allows for backtracking, self-correction, and exploring multiple solutions. The authors systematically analyze what drives models to generate longer, more sophisticated reasoning chains and provide some key insights into training strategies. 🔹 One of the biggest takeaways is that Supervised Fine-Tuning (SFT) on long CoT data significantly improves RL training. While not strictly necessary, it makes reinforcement learning (RL) much more effective, allowing models to scale their reasoning abilities. Short CoTs tend to plateau early, while long CoTs continue improving with more data. 🔹 However, simply applying RL to extend CoT length doesn’t always work. The authors show that RL training for long CoT is often unstable, with models either failing to extend reasoning length or artificially inflating responses without meaningful reasoning. To address this, they introduce a Cosine Length-Scaling Reward, which stabilizes length growth while preserving reasoning quality. They also find that reward hacking is a major issue—models will start repeating phrases to maximize length-based rewards rather than improving their reasoning. This is mitigated using an n-gram repetition penalty, ensuring that added steps actually contribute to problem-solving. 🔹 Another key challenge is the need for high-quality verifiable reward signals. Traditional RL rewards rely on ground-truth answers, but such data is scarce. The authors explore using noisy, web-extracted (“silver”) supervision data and find that, with proper filtering, it significantly improves out-of-distribution generalization. 🔹 One of the most interesting findings is that long CoT reasoning isn’t entirely emergent—many core abilities, like error correction and branching, are already present in base models. RL doesn’t create these from scratch but guides the model toward using them more effectively.

0

1

2

Vinija Jain

@VinijaJain

1 day

🐋 @deepseek_ai 's R1 Primer - Version 2 🔬 Over the past few days, @i_amanchadha and I have taken our DeepSeek primer ( and bolstered it with even more detail. It contains details on the architectural foundations behind R1 such as: - Mixture of Experts (MoE) framework for efficient parameter usage - Multi-head latent attention mechanisms - Advanced quantization techniques - The entire training pipeline from pre-training up to reasoning - Details on GRPO and how R1 leveraged it - Reasoning Datasets currently available - Multi-token prediction for enhanced performance and more! Tell us what you think and what you'd like to see next!

0

3

Vinija Jain

@VinijaJain

1 day

🐋 @deepseek_ai 's R1 Primer - Version 2 🔬 Over the past few days, @i_amanchadha and I have taken our DeepSeek primer ( and bolstered it with even more detail. It contains details on the architectural foundations behind R1 such as: - Mixture of Experts (MoE) framework for efficient parameter usage - Multi-head latent attention mechanisms - Advanced quantization techniques - The entire training pipeline from pre-training up to reasoning - Details on GRPO and how R1 leveraged it - Reasoning Datasets currently available - Multi-token prediction for enhanced performance and more! Tell us what you think and what you'd like to see next!

0

3

Vinija Jain

@VinijaJain

1 day

🐋 @deepseek_ai 's R1 Primer - Version 2 🔬 Over the past few days, @i_amanchadha and I have taken our DeepSeek primer ( and bolstered it with even more detail. It contains details on the architectural foundations behind R1 such as: - Mixture of Experts (MoE) framework for efficient parameter usage - Multi-head latent attention mechanisms - Advanced quantization techniques - The entire training pipeline from pre-training up to reasoning - Details on GRPO and how R1 leveraged it - Reasoning Datasets currently available - Multi-token prediction for enhanced performance and more! Tell us what you think and what you'd like to see next!

0

1

Vinija Jain

@VinijaJain

1 day

🐋 @deepseek_ai 's R1 Primer - Version 2 🔬 Over the past few days, @i_amanchadha and I have taken our DeepSeek primer ( and bolstered it with even more detail. It contains details on the architectural foundations behind R1 such as: - Mixture of Experts (MoE) framework for efficient parameter usage - Multi-head latent attention mechanisms - Advanced quantization techniques - The entire training pipeline from pre-training up to reasoning - Details on GRPO and how R1 leveraged it - Reasoning Datasets currently available - Multi-token prediction for enhanced performance and more! Tell us what you think and what you'd like to see next!

0

1

Vinija Jain

@VinijaJain

1 day

🐋 @deepseek_ai 's R1 Primer - Version 2 🔬 Over the past few days, @i_amanchadha and I have taken our DeepSeek primer ( and bolstered it with even more detail. It contains details on the architectural foundations behind R1 such as: - Mixture of Experts (MoE) framework for efficient parameter usage - Multi-head latent attention mechanisms - Advanced quantization techniques - The entire training pipeline from pre-training up to reasoning - Details on GRPO and how R1 leveraged it - Reasoning Datasets currently available - Multi-token prediction for enhanced performance and more! Tell us what you think and what you'd like to see next!

0

2

Vinija Jain

@VinijaJain

1 day

🐋 @deepseek_ai 's R1 Primer - Version 2 🔬 Over the past few days, @i_amanchadha and I have taken our DeepSeek primer ( and bolstered it with even more detail. It contains details on the architectural foundations behind R1 such as: - Mixture of Experts (MoE) framework for efficient parameter usage - Multi-head latent attention mechanisms - Advanced quantization techniques - The entire training pipeline from pre-training up to reasoning - Details on GRPO and how R1 leveraged it - Reasoning Datasets currently available - Multi-token prediction for enhanced performance and more! Tell us what you think and what you'd like to see next!

0

2

Vinija Jain

@VinijaJain

1 day

🐋 @deepseek_ai 's R1 Primer - Version 2 🔬 Over the past few days, @i_amanchadha and I have taken our DeepSeek primer ( and bolstered it with even more detail. It contains details on the architectural foundations behind R1 such as: - Mixture of Experts (MoE) framework for efficient parameter usage - Multi-head latent attention mechanisms - Advanced quantization techniques - The entire training pipeline from pre-training up to reasoning - Details on GRPO and how R1 leveraged it - Reasoning Datasets currently available - Multi-token prediction for enhanced performance and more! Tell us what you think and what you'd like to see next!

0

2

Vinija Jain

@VinijaJain

1 day

🐋 @deepseek_ai 's R1 Primer - Version 2 🔬 Over the past few days, @i_amanchadha and I have taken our DeepSeek primer ( and bolstered it with even more detail. It contains details on the architectural foundations behind R1 such as: - Mixture of Experts (MoE) framework for efficient parameter usage - Multi-head latent attention mechanisms - Advanced quantization techniques - The entire training pipeline from pre-training up to reasoning - Details on GRPO and how R1 leveraged it - Reasoning Datasets currently available - Multi-token prediction for enhanced performance and more! Tell us what you think and what you'd like to see next!

0

1

Vinija Jain

@VinijaJain

1 day

🐋 @deepseek_ai 's R1 Primer - Version 2 🔬 Over the past few days, @i_amanchadha and I have taken our DeepSeek primer ( and bolstered it with even more detail. It contains details on the architectural foundations behind R1 such as: - Mixture of Experts (MoE) framework for efficient parameter usage - Multi-head latent attention mechanisms - Advanced quantization techniques - The entire training pipeline from pre-training up to reasoning - Details on GRPO and how R1 leveraged it - Reasoning Datasets currently available - Multi-token prediction for enhanced performance and more! Tell us what you think and what you'd like to see next!

0

2

11

Vinija Jain

@VinijaJain

3 days

RT @VinijaJain: @karpathy You know it’s going to be a good day when @karpathy releases a new video. They’re goldmines, all of them! 😍

0

2

0

Vinija Jain

@VinijaJain

3 days

@karpathy You know it’s going to be a good day when @karpathy releases a new video. They’re goldmines, all of them! 😍

1

2

10

Vinija Jain

@VinijaJain

6 days

@snsf The applications for Deep Research seem endless—medicine, education, tech, and more. Can't wait to try it out!

0

2