
Vincent
@vvvincent_c
Followers
488
Following
5K
Media
91
Statuses
1K
research @METR_Evals undergrad @Cornell | prev @veritasium @atlasfellow
Joined April 2019
in cursor, when i right click + copy an image, it doesn't save to my clipboard. i have to spam it 2-3 times before it's actually saved. how...
0
0
5
(i do not condone reckless driving, this is very dangerous, and don't do this)
0
0
0
f1's version of building in public https://t.co/TREb309TEK
1
0
1
absolutely insane end to cji 2... some commentary: - day 1 was pretty disappointing. i don't want to be a hater, but submissions make the event. most of the matches were unsuccessful leg entanglements, guard work, and slow wrestling. the female matches were great tho. - the new
1
0
5
wow 73% of cornell cs students don't believe GPT-5 has "college level language understanding"
300
68
6K
what's the difference between teaching a foreign language at a college vs a high school?
3
0
6
is this ever going to be fixed https://t.co/nCfDq6GhSf
1
0
1
read the actual post for even more details! https://t.co/zTNDc7mN5r
0
0
1
7. A chunk of GPT-5 failures might be due to "spurious" bugs, which, when patched, may increase the time horizon even further.
1
0
3
6. 18/789 (2.3%) runs included reward hacks. It's not a huge amount, but if these data points had been used, the median would've been 3 hours (30% higher) compared to 2 hours and 17 mins.
1
0
2
I would've thought an 8-hour time horizon (almost a whole day's worth of work!) might already be a 10x boost, but other factors, such as higher context and code quality standards, will make automated R&D not as easy.
1
0
1
5. A 10x uplift could require a week-long (40 hours) time horizon.
1
0
2
This may just be the tip of the iceberg. Maybe METR simply didn't have access to many wackier reasoning traces.
1
0
2
4. Reasoning traces were occasionally inscrutable, but it’s unlikely that they represent encoded reasoning. Below are examples where GPT-5 ‘gets stuck’ within its reasoning trace, and produces repeated dots before ‘snapping out of it’ and continuing on with the task.
1
0
2
3. GPT-5 randomly thinks @redwood_ai is running the evaluations. This is wild to me. Models are starting to show stronger signs of life in situational awareness in evaluations.
2
0
4
@nikolaj2030 speculates that it might mean "wait" or "continue". @HjalmarWijk suggested it might be RL with reasoning-length penalty pushing models to even out probability mass across tokens to increase entropy, and so words that it wouldn't use often start to expand in meaning.
1
0
3
2. GPT-5 invented the word "marinade" and used it many times during its reasoning. After inspecting transcripts, researchers don't have a clear sense of what it means.
1
0
3