Vincent @vvvincent_c X Profile

Vincent

@vvvincent_c

Followers

488

Following

5K

Media

91

Statuses

1K

research @METR_Evals undergrad @Cornell | prev @veritasium @atlasfellow

https://t.co/BLjBFBWgtK

Joined April 2019

Don't wanna be here? Send us removal request.

Vincent

@vvvincent_c

2 days

it might be getting out of hand...

1

0

4

Vincent

@vvvincent_c

6 days

in cursor, when i right click + copy an image, it doesn't save to my clipboard. i have to spam it 2-3 times before it's actually saved. how...

0

5

Vincent

@vvvincent_c

15 days

it's that simple

3

0

22

Vincent

@vvvincent_c

18 days

(i do not condone reckless driving, this is very dangerous, and don't do this)

0

Vincent

@vvvincent_c

18 days

f1's version of building in public https://t.co/TREb309TEK

k’ ▪️

@k1ragoat

18 days

@YouTube

1

0

1

Vincent

@vvvincent_c

23 days

absolutely insane end to cji 2... some commentary: - day 1 was pretty disappointing. i don't want to be a hater, but submissions make the event. most of the matches were unsuccessful leg entanglements, guard work, and slow wrestling. the female matches were great tho. - the new

1

0

5

Vincent

@vvvincent_c

23 days

wow 73% of cornell cs students don't believe GPT-5 has "college level language understanding"

300

68

6K

Vincent

@vvvincent_c

24 days

what's the difference between teaching a foreign language at a college vs a high school?

3

0

6

Vincent

@vvvincent_c

1 month

is this ever going to be fixed https://t.co/nCfDq6GhSf

Vincent

@vvvincent_c

5 months

when i use X dms on my computer, my screen randomly glitches every ~30 seconds or so. anyone else?

1

0

1

Vincent

@vvvincent_c

1 month

read the actual post for even more details! https://t.co/zTNDc7mN5r

0

1

Vincent

@vvvincent_c

1 month

sources: https://t.co/zTNDc7mffT https://t.co/n3sG2U6SHR https://t.co/7uF94IWsf2 https://t.co/PMREY6fdmx https://t.co/D9lSxYBLo2

metr.org

1

0

1

Vincent

@vvvincent_c

1 month

7. A chunk of GPT-5 failures might be due to "spurious" bugs, which, when patched, may increase the time horizon even further.

1

0

3

Vincent

@vvvincent_c

1 month

6. 18/789 (2.3%) runs included reward hacks. It's not a huge amount, but if these data points had been used, the median would've been 3 hours (30% higher) compared to 2 hours and 17 mins.

1

0

2

Vincent

@vvvincent_c

1 month

I would've thought an 8-hour time horizon (almost a whole day's worth of work!) might already be a 10x boost, but other factors, such as higher context and code quality standards, will make automated R&D not as easy.

1

0

1

Vincent

@vvvincent_c

1 month

5. A 10x uplift could require a week-long (40 hours) time horizon.

1

0

2

Vincent

@vvvincent_c

1 month

This may just be the tip of the iceberg. Maybe METR simply didn't have access to many wackier reasoning traces.

1

0

2

Vincent

@vvvincent_c

1 month

4. Reasoning traces were occasionally inscrutable, but it’s unlikely that they represent encoded reasoning. Below are examples where GPT-5 ‘gets stuck’ within its reasoning trace, and produces repeated dots before ‘snapping out of it’ and continuing on with the task.

1

0

2

Vincent

@vvvincent_c

1 month

3. GPT-5 randomly thinks @redwood_ai is running the evaluations. This is wild to me. Models are starting to show stronger signs of life in situational awareness in evaluations.

2

0

4

Vincent

@vvvincent_c

1 month

@nikolaj2030 speculates that it might mean "wait" or "continue". @HjalmarWijk suggested it might be RL with reasoning-length penalty pushing models to even out probability mass across tokens to increase entropy, and so words that it wouldn't use often start to expand in meaning.

1

0

3

Vincent

@vvvincent_c

1 month

2. GPT-5 invented the word "marinade" and used it many times during its reasoning. After inspecting transcripts, researchers don't have a clear sense of what it means.

1

0

3