![kessler Profile](https://pbs.twimg.com/profile_images/1826368902181302272/xhDPZcAF_x96.jpg)
kessler
@k_ssl_r
Followers
187
Following
171
Statuses
71
part-time fortune teller, est 2008
usa
Joined August 2023
@doomslide @7oponaut near certainly arc; scaffolded arc with prog synth was 52%? don’t think anything scaffolded or not got over like 2% on frontier math pre o3
1
0
2
@aidan_mclau @teortaxesTex feeling a mistral esque structure where at some point like Deepseek-R2-236B-reasoner-gpt-o1-whatever-the-hell-2504 goes open weight noncom like ms large
0
0
1
@Grad62304977 @aidan_mclau I like 30b, Cohere Command R / Qwen2.5 32B can do a lot imo. <10b feels too small and >=70b feels too big. goldilocks optimal. idk re scaling down though
0
0
3
@iamwaynechi @YouJiacheng @teortaxesTex is it possible for you to share numbers for mean? had the same experience w/ 100+ conc calls but never actually ran the nums and just a bit curious
1
0
2
@YouJiacheng @teortaxesTex with high concurrency the deepseek api would literally have 1tps return rate for me. diff from ft latency ofc but still seems like it'd get in the way for cc
0
0
5
@xprunie @khushkhushkhush @aaruHQ @seekingtau @virtualned our election model wasn’t up a few months ago, not true lol
0
0
1
@teortaxesTex Mean Response Length / Arena Hard Score ``` Llama 3.1 70B -> 31.0 Llama 3.1 Nemotron 70B -> 25.8 Llama 3.1 405B -> 24.0 GPT-4o -> 22.0 Claude 3.5 Sonnet -> 20.4 ```
0
0
0
RT @teortaxesTex: > llama and Gpt4o have identical mean response length > MRL of llama-nemotron increases by 27% > LLM as a judge evals Th…
0
1
0
skeptical of the benchmarks being length biased. check the mean response length. obviously not a perfect heuristic but if you divide the Mean Response Length by the Arena Hard Score (to balance for length bias), you get the following (where a lower score is better) ``` Llama 3.1 70B -> 31.0 Llama 3.1 Nemotron 70B -> 25.8 Llama 3.1 405B -> 24.0 GPT-4o -> 22.0 Claude 3.5 Sonnet -> 20.4 ``` which makes more sense to me
0
0
0