![Blaze (Balázs Galambosi) Profile](https://pbs.twimg.com/profile_images/862592191969079298/8SK2YdOn_x96.jpg)
Blaze (Balázs Galambosi)
@gblazex
Followers
1K
Following
4K
Statuses
4K
A Smooth Guy; Developer of SmoothScroll for macOS, Windows & Google Chrome.
Joined April 2010
Looking further into LLM benchmark x-correlations: - Top row: how each benchmark relates to human judgement (Arena Elo) - Other rows: any benchmark pair & their relationship - On the right: samples = # of models tested for each benchmark thx: @chipro @maximelabonne @ldjconfirmed
@AlphaSignalAI @ClementDelangue I pretty much only trust two LLM evals right now: Chatbot Arena and r/LocalLlama comments section
12
48
272
New Arena for audio models
People like to talk as it's easy and natural. Now that there are Large *Audio* Models 🔊, which model do users like the most? Introducing Talk Arena🎤: an open platform where users speak to LAMs and receive text responses. Through open interaction, we focus on rankings based on user preferences rather than static benchmarks. Jump on Talk Arena to compare 🗳️ speech AI🧵 (1/5)
0
0
1
RT @agromanou: 🚀 Introducing INCLUDE 🌍: A multilingual LLM evaluation benchmark spanning 44 languages! Contains *newly-collected* data, pri…
0
60
0
Why is Google trends graph so different than this?
it cannot be underestimated the extent to which Claude 3.5 Sonnet alone is solely responsible for this 2xing of @AnthropicAI market share absolute home run model
1
0
1
@karpathy Ask 10 experts is a huge win. Even better is ask 10 experts to rate 10 alternatives and pick the best one based on key dimensions. Path not traveled in your search for close to optimal solution. It does have all human knowledge but in a shallow sense. Go not farther but faster.
0
0
1
RT @ilanbigio: turns out you can use <xml/> tags with the realtime api to control tone with _super_ high granularity 🤷🏽♂️ @openai devday…
0
35
0
Hard real-life coding evals!
Who's the best AI software engineer? Introducing RepoChat Arena: the live AI software engineering battle!🔥🤖 1. Input any public Github link (repo/issue/PR). 2. Ask the models to fix issues, add features or chat with a repo. 3. Vote for the better one and shape the leaderboard! Watch AI solve your real-world coding tasks live at RepoChat! Exciting use cases in the thread below🧵
0
0
2