Vincent Profile
Vincent

@vvvincent_c

Followers
488
Following
5K
Media
91
Statuses
1K

research @METR_Evals undergrad @Cornell | prev @veritasium @atlasfellow

Joined April 2019
Don't wanna be here? Send us removal request.
@vvvincent_c
Vincent
2 days
it might be getting out of hand...
1
0
4
@vvvincent_c
Vincent
6 days
in cursor, when i right click + copy an image, it doesn't save to my clipboard. i have to spam it 2-3 times before it's actually saved. how...
0
0
5
@vvvincent_c
Vincent
15 days
it's that simple
3
0
22
@vvvincent_c
Vincent
18 days
(i do not condone reckless driving, this is very dangerous, and don't do this)
0
0
0
@vvvincent_c
Vincent
18 days
f1's version of building in public https://t.co/TREb309TEK
@k1ragoat
k’ ▪️
18 days
1
0
1
@vvvincent_c
Vincent
23 days
absolutely insane end to cji 2... some commentary: - day 1 was pretty disappointing. i don't want to be a hater, but submissions make the event. most of the matches were unsuccessful leg entanglements, guard work, and slow wrestling. the female matches were great tho. - the new
1
0
5
@vvvincent_c
Vincent
23 days
wow 73% of cornell cs students don't believe GPT-5 has "college level language understanding"
300
68
6K
@vvvincent_c
Vincent
24 days
what's the difference between teaching a foreign language at a college vs a high school?
3
0
6
@vvvincent_c
Vincent
1 month
is this ever going to be fixed https://t.co/nCfDq6GhSf
@vvvincent_c
Vincent
5 months
when i use X dms on my computer, my screen randomly glitches every ~30 seconds or so. anyone else?
1
0
1
@vvvincent_c
Vincent
1 month
read the actual post for even more details! https://t.co/zTNDc7mN5r
0
0
1
@vvvincent_c
Vincent
1 month
7. A chunk of GPT-5 failures might be due to "spurious" bugs, which, when patched, may increase the time horizon even further.
1
0
3
@vvvincent_c
Vincent
1 month
6. 18/789 (2.3%) runs included reward hacks. It's not a huge amount, but if these data points had been used, the median would've been 3 hours (30% higher) compared to 2 hours and 17 mins.
1
0
2
@vvvincent_c
Vincent
1 month
I would've thought an 8-hour time horizon (almost a whole day's worth of work!) might already be a 10x boost, but other factors, such as higher context and code quality standards, will make automated R&D not as easy.
1
0
1
@vvvincent_c
Vincent
1 month
5. A 10x uplift could require a week-long (40 hours) time horizon.
1
0
2
@vvvincent_c
Vincent
1 month
This may just be the tip of the iceberg. Maybe METR simply didn't have access to many wackier reasoning traces.
1
0
2
@vvvincent_c
Vincent
1 month
4. Reasoning traces were occasionally inscrutable, but it’s unlikely that they represent encoded reasoning. Below are examples where GPT-5 ‘gets stuck’ within its reasoning trace, and produces repeated dots before ‘snapping out of it’ and continuing on with the task.
1
0
2
@vvvincent_c
Vincent
1 month
3. GPT-5 randomly thinks @redwood_ai is running the evaluations. This is wild to me. Models are starting to show stronger signs of life in situational awareness in evaluations.
2
0
4
@vvvincent_c
Vincent
1 month
@nikolaj2030 speculates that it might mean "wait" or "continue". @HjalmarWijk suggested it might be RL with reasoning-length penalty pushing models to even out probability mass across tokens to increase entropy, and so words that it wouldn't use often start to expand in meaning.
1
0
3
@vvvincent_c
Vincent
1 month
2. GPT-5 invented the word "marinade" and used it many times during its reasoning. After inspecting transcripts, researchers don't have a clear sense of what it means.
1
0
3