![Ilya Abyzov Profile](https://pbs.twimg.com/profile_images/3571901117/d20f47ba42de407cdb930dacb69f2ce8_x96.jpeg)
Ilya Abyzov
@IlyaAbyzov
Followers
4K
Following
5K
Statuses
3K
Ex-ex-engineer building whatever seems funniest at the moment. Co-founder @goforward. Launched uberX & led Uber SF
San Francisco
Joined April 2010
Inspired by @karpathy and the idea of using games to compare LLMs, I've built a version of the game Codenames where different models are paired in teams to play the game with each other. Fun to see o3-mini team with R1 against Grok and Gemini! Link and repo below.
I quite like the idea using games to evaluate LLMs against each other, instead of fixed evals. Playing against another intelligent entity self-balances and adapts difficulty, so each eval (/environment) is leveraged a lot more. There's some early attempts around. Exciting area.
56
228
3K
@tickerBITCOINbb @punk9059 Damn, hope they refunded your lift pass so you at least got to ski free
0
0
2
Say what you will about the destruction of basic ethics and decency, but if we can cancel some SaaS seats and save 0.00000001% of the federal budget to buy three new rivets on a Boeing strategic bomber instead, it will have been well worth it.
There are tens of millions of media & software subscriptions paid by the federal government – your tax dollars – that show ZERO usage!!
1
0
14
@polynoamial I think I'm 2/3rds of the way there with something like this, which I can extend to more types of games. How would you suggest making it a more proper eval? Persistent leaderboards a la Chatbot Arena? Use ELO or something else?
Inspired by @karpathy and the idea of using games to compare LLMs, I've built a version of the game Codenames where different models are paired in teams to play the game with each other. Fun to see o3-mini team with R1 against Grok and Gemini! Link and repo below.
0
0
2
@kadikraman That would be great. Was confused about best practices on how to mix tab vs modal routing in my first Expo project. When to present things from left vs bottom, how to use drawers well, how to provide expected back button behavior on screens reachable from different places, etc
0
0
3
@iamjakestream Pretty sure Travis once said: “It’s like there’s a train coming by with bags of money on it and it’s irresponsible not to take the bags off the train since you don’t know if it’ll come around again” Definitely works unless it doesn’t.
1
0
3
@iamjakestream Would have taken you up on it, but I’m an AI board game entrepreneur now. $0 in topline but a cool -$80 in EBITDA once I count the API costs. Preseed oversubscribed
Inspired by @karpathy and the idea of using games to compare LLMs, I've built a version of the game Codenames where different models are paired in teams to play the game with each other. Fun to see o3-mini team with R1 against Grok and Gemini! Link and repo below.
1
0
1
@rauchg They really are! Should I be using Vercel instead of CF? I really like CF workers and Vite but keeping an open mind.
Inspired by @karpathy and the idea of using games to compare LLMs, I've built a version of the game Codenames where different models are paired in teams to play the game with each other. Fun to see o3-mini team with R1 against Grok and Gemini! Link and repo below.
0
1
6
@ambelamps @devahaz What about special forces guy forced by circumstances to coach his kid's little league team?
1
0
3