Ilya Abyzov @IlyaAbyzov profile

Ilya Abyzov

@IlyaAbyzov

Followers

4K

Following

5K

Statuses

3K

Ex-ex-engineer building whatever seems funniest at the moment. Co-founder @goforward. Launched uberX & led Uber SF

San Francisco

Joined April 2010

Don't wanna be here? Send us removal request.

Ilya Abyzov

@IlyaAbyzov

12 days

Inspired by @karpathy and the idea of using games to compare LLMs, I've built a version of the game Codenames where different models are paired in teams to play the game with each other. Fun to see o3-mini team with R1 against Grok and Gemini! Link and repo below.

Andrej Karpathy

@karpathy

12 days

I quite like the idea using games to evaluate LLMs against each other, instead of fixed evals. Playing against another intelligent entity self-balances and adapts difficulty, so each eval (/environment) is leveraged a lot more. There's some early attempts around. Exciting area.

56

228

3K

Ilya Abyzov

@IlyaAbyzov

2 days

Elon Musk

@elonmusk

3 days

@provisionalidea This retard thinks the government uses SQL

2

0

5

Ilya Abyzov

@IlyaAbyzov

2 days

@tickerBITCOINbb @punk9059 Damn, hope they refunded your lift pass so you at least got to ski free

0

2

Ilya Abyzov

@IlyaAbyzov

5 days

Say what you will about the destruction of basic ethics and decency, but if we can cancel some SaaS seats and save 0.00000001% of the federal budget to buy three new rivets on a Boeing strategic bomber instead, it will have been well worth it.

Elon Musk

@elonmusk

5 days

There are tens of millions of media & software subscriptions paid by the federal government – your tax dollars – that show ZERO usage!!

1

0

14

Ilya Abyzov

@IlyaAbyzov

5 days

@gizakdag Love this one

0

3

Ilya Abyzov

@IlyaAbyzov

5 days

Get rekt

0

1

Ilya Abyzov

@IlyaAbyzov

6 days

Aside from being a funny Hail Mary, it shows a trade-off with thinking models: you can’t run this much inference on a massive # of params, so you ablate some useful world knowledge (like that this isn’t how Codenames works) in exchange for clever depth of thought.

0

Ilya Abyzov

@IlyaAbyzov

7 days

@polynoamial I think I'm 2/3rds of the way there with something like this, which I can extend to more types of games. How would you suggest making it a more proper eval? Persistent leaderboards a la Chatbot Arena? Use ELO or something else?

Ilya Abyzov

@IlyaAbyzov

12 days

Inspired by @karpathy and the idea of using games to compare LLMs, I've built a version of the game Codenames where different models are paired in teams to play the game with each other. Fun to see o3-mini team with R1 against Grok and Gemini! Link and repo below.

0

2

Ilya Abyzov

@IlyaAbyzov

7 days

OpenAI pulled a trick that 100% of people fell for: Deep Research’s score on the test is using live retrieval from the web, which obviously makes it completely not an apples-to-apples with any static model. They asterisked this in their table and everyone ignored the asterisk

Tomas Pueyo

@tomaspueyo

8 days

It's coming

2

0

3

Ilya Abyzov

@IlyaAbyzov

7 days

@MarshallOsborne Sadly, it really was named UBERx at first:

0

Ilya Abyzov

@IlyaAbyzov

7 days

@MarshallOsborne o3-mini-high-BLACKx

1

0

Ilya Abyzov

@IlyaAbyzov

9 days

@kadikraman That would be great. Was confused about best practices on how to mix tab vs modal routing in my first Expo project. When to present things from left vs bottom, how to use drawers well, how to provide expected back button behavior on screens reachable from different places, etc

0

3

Ilya Abyzov

@IlyaAbyzov

9 days

@iamjakestream Pretty sure Travis once said: “It’s like there’s a train coming by with bags of money on it and it’s irresponsible not to take the bags off the train since you don’t know if it’ll come around again” Definitely works unless it doesn’t.

1

0

3

Ilya Abyzov

@IlyaAbyzov

9 days

@iamjakestream Learned that one the hard way already!

1

0

2

Ilya Abyzov

@IlyaAbyzov

9 days

@iamjakestream Would have taken you up on it, but I’m an AI board game entrepreneur now. $0 in topline but a cool -$80 in EBITDA once I count the API costs. Preseed oversubscribed

Ilya Abyzov

@IlyaAbyzov

12 days

Inspired by @karpathy and the idea of using games to compare LLMs, I've built a version of the game Codenames where different models are paired in teams to play the game with each other. Fun to see o3-mini team with R1 against Grok and Gemini! Link and repo below.

1

0

1

Ilya Abyzov

@IlyaAbyzov

9 days

@rauchg They really are! Should I be using Vercel instead of CF? I really like CF workers and Vite but keeping an open mind.

Ilya Abyzov

@IlyaAbyzov

12 days

Inspired by @karpathy and the idea of using games to compare LLMs, I've built a version of the game Codenames where different models are paired in teams to play the game with each other. Fun to see o3-mini team with R1 against Grok and Gemini! Link and repo below.

0

1

6

Ilya Abyzov

@IlyaAbyzov

9 days

@ambelamps @devahaz What about special forces guy forced by circumstances to coach his kid's little league team?

1

0

3

Ilya Abyzov

@IlyaAbyzov

10 days

@oscarle_x @PawsMetax Yep, exactly.

0

2