![Maxime Rivest π§ββοΈπ¦ Profile](https://pbs.twimg.com/profile_images/1834591719356030976/m0v8shuf_x96.jpg)
Maxime Rivest π§ββοΈπ¦
@MaximeRivest
Followers
360
Following
513
Statuses
1K
Distributor of Open WebUI at https://t.co/eOfBo6QJB7. | I build systems to run 1M prompts/day | 1M+ views on YT | AI understanding, experience, and inspiration for all
Joined January 2018
Finetunes are everywhere.
Google Chief Scientist, Jeff Dean "AI now generates 25% of Google's integrated code" Google has already trained a Gemini model on its internal codebase to help developers this doesnβt cover everything, but it improves AI-assisted coding by integrating code into its parameters
0
0
0
I am thinking of applications where one-off questions (prompts) are applied on databases of 1 000 000+ rows. Interpolating many columns into the context of the prompt. So almost nothing seems overkill for my task π
My dream final system should, take a user prompt and: 1) Find good row examples for the user to manually classify 10-100 of those. 2) optimized the prompt for a large model 3) apply to ~10000 to create a synthetic dataset 4) fine tune a small model 5) apply the finetune the to whole dataframe.
1
0
0
Hmmm, I see what you mean. Vllm, announced some optimization for deepseek (and other moe), last week. Could It be that we are, still, really under optimizing it? What I want to see (or try) is a benchmark between sglang and vllm for deepseek v2.5, Llama 405b and deepseek v3/r1. Deepseek should strongly outperform llama because it can process batch faster given it is an moe, right? Also, I would love for the moe to be very modular between domains, it could mean that we could load less experts in vram. I have not seen any study of the moe distribution for each tokens. Do you know if the 'load balancing' of the moe, done at training, kinda guarantees, that it's all over the place?
0
0
1
The era of fine-tune small models is now!!!
Weβve post-trained some really good models on top of Llama 3.3 that far surpass 4o-mini and 3.5-Haiku and match 4o and 3.5-Sonnet for answer quality, despite being way cheaper and blazing fast! Users are loving it already based on retention metrics. This is thanks to our work on building our custom inference stack on top of NVIDIA GPUs as well as exploring new hardware like Cerebras!
0
0
2
Yes!! I love big inference. I am just starting to marry it to dspy. From what I have seen in the docs, dspy in more 1-to-1 designed. Makes sense for the model designing the prompts, but less for the model doing the task. Especially when the task is when of those that you mention. Do you know is there is an easy 'big inference' mode?
1
0
1
Fine tune Fine tune Fine tune. It's may be the year of Agents and Voice but it is also the year of Finetunes
Apple MLX fine tuning of Qwen/Qwen2.5-7B-Instruct on Italian wine classification. Base model: 62% accuracy Fine tuned model: 82% accuracy π₯Fine tuning works! π₯ I still remember 1 year ago when people where telling me prompting + few shots were enough π
1
1
5
@TrelisResearch I was taking the number from this other post they made. Am I miss interpreting median throughput?
1
0
1
Yet another example that the future is one of small fine-tuned models reached through a dispatcher.
I tested fine tuning of Qwen/Qwen2.5-3B-Instruct with my wine_classification tests and it rocks! From 54% to 70% after 0.5 epochs! I added a new wine_mlx_server_unstructured.py provider without structured output.
0
0
1
Yes, my understanding is that it still gets out at about 50 tok/sec(no?) for a given sequence. Based on my (imperfect) understanding of datacrunch other tables. But my theory is not strong enough to predict this very precisely. That's why I'll run it soon (if I don't find someone doing it soon ) π
I focus more on 'shear token' output as this is more important to my application and for things like 'deep research' latency does not count anymore.
1
0
1
Everything points to small specialized models.
What if a [MASK] was all you needed? ModernBERT is great, but we couldn't stop wondering if it could be greater than previous encoders in different ways. Maybe we don't need task-specific heads? Maybe it can do all sort of tasks with only its generative head? Spoilers: Yes
0
0
0
RT @ivanfioravanti: My family AI server has now access to o3-mini, web search through searxng and Flux images generation through Replicateβ¦
0
2
0
@rasbt Should go for more complexe sciences. AI is fuzzy, and I train on numerical ecology. It does not get more complex than life at that scale.
0
0
0
@DaveShapi But also, for inference, data centers are unnecessary. Each village, town, and city can host their on inference server. Heating up the city hall at the same time :D
A remote village of 1000 family could provide A(G?)I (for free) for all its citizens for 5-10 years for the same price as building a new 1/3 mile of road... the implication of every city doing that would be crazy !!! here is the the proof π§΅
0
0
0