![Rosinality Profile](https://pbs.twimg.com/profile_images/1322166709608804352/HbzdNpgn_x96.jpg)
Rosinality
@rosinality
Followers
2K
Following
21K
Statuses
32K
Very interesting analysis. It may be hard to find thought process logs from individuals in web texts. However, the process of many people collaborating and communicating can become a log of collective thought processes, and models might be able to learn way of thinking from it.
Takeaway 9: Commonly used pre-training data contain content that shares similar properties (e.g., branching and error validation) to long CoT. The base model might already acquire such skills during pre-training. We use MiniHash to search over OpenWebMath and Perplexity to search over the internet. We found many examples containing such patterns. Here is one example (
0
0
2
@TianzheC Thank you for the great research! Considering fig 10, is it possible to say most improvement on OOD came from iterative revision and verification? (Of course, improvement in ID without affecting OOD performance itself is a nice thing to have.)
1
0
2
RT @aichupanda: 52시간제 때문에 한국이 미국과 AI 경쟁에 뒤쳐진다고 생각하시는 분들은 공학을 너무 얕잡아보시는 것.. 그게 그렇게 간단한 일이 아닙니다.
0
37
0
@giffmana As it is FP8 and MoE flops it will have lower MFU compared to BF16. Also I suspect that fine-grained quantization and other modifications will further reduce its efficiency. I guess DeepSeek primarily adopted FP8 to reduce communication volume.
1
0
5
@nameEO 그래서 같은 크기의 Query/Key 임베딩과 RoPE 벡터를 더하더라도 실제로는 의미에 대한 모델링과 위치에 대한 모델링을 하는 차원이 분리되는 쪽으로 학습된다고도 합니다.
1
0
0
RT @SKR_Economist: 원래 인과추론 공부하다가 거시경제로 빠진 사람이라 주기적으로 핫한 인과추론쪽 페이퍼 읽으며 끈을 놓지 않으려 기를 쓰는데, 몇 안되는 인과관계 입증해 보겠다고 머리를 쥐어 짜내는 뛰어난 사람들 지켜보다가 타임라인에서…
0
4
0