kessler @k_ssl_r profile

kessler

@k_ssl_r

Followers

187

Following

171

Statuses

71

part-time fortune teller, est 2008

usa

Joined August 2023

Don't wanna be here? Send us removal request.

kessler

@k_ssl_r

1 day

@vipulved @togethercompute twice that of the official api, insane

1

0

6

kessler

@k_ssl_r

2 months

@7oponaut @doomslide yes; i think o3 low reasoning is ~$20 and o3 high reasoning is ~$3500

1

0

2

kessler

@k_ssl_r

2 months

@doomslide @7oponaut near certainly arc; scaffolded arc with prog synth was 52%? don’t think anything scaffolded or not got over like 2% on frontier math pre o3

1

0

2

kessler

@k_ssl_r

2 months

@xanderatallah @0xAgentProtocol

0

2

kessler

@k_ssl_r

3 months

@aidan_mclau @teortaxesTex feeling a mistral esque structure where at some point like Deepseek-R2-236B-reasoner-gpt-o1-whatever-the-hell-2504 goes open weight noncom like ms large

0

1

kessler

@k_ssl_r

3 months

@YouJiacheng Appears to be a “trillion parameter Moe” per website

1

0

5

kessler

@k_ssl_r

3 months

@Grad62304977 @aidan_mclau I like 30b, Cohere Command R / Qwen2.5 32B can do a lot imo. <10b feels too small and >=70b feels too big. goldilocks optimal. idk re scaling down though

0

3

kessler

@k_ssl_r

3 months

@iamwaynechi @YouJiacheng @teortaxesTex is it possible for you to share numbers for mean? had the same experience w/ 100+ conc calls but never actually ran the nums and just a bit curious

1

0

2

kessler

@k_ssl_r

3 months

@YouJiacheng @teortaxesTex with high concurrency the deepseek api would literally have 1tps return rate for me. diff from ft latency ofc but still seems like it'd get in the way for cc

0

5

kessler

@k_ssl_r

3 months

@tszzl visions of inverse cramer

0

2

kessler

@k_ssl_r

3 months

@xprunie @khushkhushkhush @aaruHQ @seekingtau @virtualned our election model wasn’t up a few months ago, not true lol

0

1

kessler

@k_ssl_r

3 months

Pixtral is a CogVLM2 style vision encoder and similarly to CogVLM2 is trained on fewer images with certain architectural changes that make it less effective for vision tasks -- Qwen2VL 7B is a better option for vision and is ~45% smaller

1

0

1

kessler

@k_ssl_r

3 months

@_xjdr agree - should be accessible but not vis by default imo. wish o1 cot summaries were more granular as the samples from the website and leaked reasoning traces that it occasionally spits out are very interesting

1

0

5

kessler

@k_ssl_r

3 months

@PalmerLuckey 56 hours

0

kessler

@k_ssl_r

3 months

@_xjdr 😿

1

0

16

kessler

@k_ssl_r

4 months

@teortaxesTex Mean Response Length / Arena Hard Score ``` Llama 3.1 70B -> 31.0 Llama 3.1 Nemotron 70B -> 25.8 Llama 3.1 405B -> 24.0 GPT-4o -> 22.0 Claude 3.5 Sonnet -> 20.4 ```

0

kessler

@k_ssl_r

4 months

RT @teortaxesTex: > llama and Gpt4o have identical mean response length > MRL of llama-nemotron increases by 27% > LLM as a judge evals Th…

0

1

0

kessler

@k_ssl_r

4 months

skeptical of the benchmarks being length biased. check the mean response length. obviously not a perfect heuristic but if you divide the Mean Response Length by the Arena Hard Score (to balance for length bias), you get the following (where a lower score is better) ``` Llama 3.1 70B -> 31.0 Llama 3.1 Nemotron 70B -> 25.8 Llama 3.1 405B -> 24.0 GPT-4o -> 22.0 Claude 3.5 Sonnet -> 20.4 ``` which makes more sense to me

0