Excited to share I've joined
@huggingface
🤗 as a ML Research Engineer part of the multimodal team! Excited to see what amazing new models and datasets we can create for the community!
🚀 Fine-tune Florence-2 on any task!
We are releasing fine tuning scripts for microsoft's Florence-2, alongside with a walkthrough blogspot, a space demo, and a Colab notebook.
@mervenoyann
@skalskip92
🧵
We added idefics2 and idefics2-chatty to the Unsolvable Problem Detection Leaderboard. 🚀 This benchmark was developed to measure the robustness of VLMs by asking them questions about images that cannot be answered.
#MachineLearning
#AI
#NLP
🧵
🚀
@argilla_io
is joining
@huggingface
🤗!
Argilla is the leading company in dataset creation with a ton of open-source contributions! Many of their efforts are to enable multilingual LLMs all over the world!
Oh, and they also are co-authors of Zephyr ORPO!
I've been working on a synthetic dataset and every day I optimize the script. So far I got:
Monday -> ETA 12 days
Tuesday -> ETA 6 days
Wednesday -> ETA 2 days
Today -> ETA 4 hours
My goal is to make it so efficient that I can 10x the original intended size 🚀🤗
@Ahmed_Masry97
@mervenoyann
@skalskip92
That's awesome! Let me know how it goes! you can start here:
You only need to do:
python distributed_train.py --dataset cauldron --epochs 5 --eval_steps 20000
Maybe playing a bit with the learning rate would also help :)
@Ahmed_Masry97
@mervenoyann
@skalskip92
You need to consider that we only fine-tuned on DocVQA, a pretty small dataset. The code to fine-tune on The Cauldron is up there and would probably achieve a better performance, but I'm GPU poor :(
Florence-2 fine-tuned on DocVQA excels at retrieving information from images! Just a few final tweaks, and I'll release the code for anyone to fine-tune for their own tasks. Stay tuned! 😊📚
Apple Intelligence is a huge tailwind for multimodal models. To do 'on-screen awareness' they need to understand image, text, audio, and video. I'm super excited to work on creating these models for the open-source community
@huggingface
@Laz4rz
@pixqc
@wateriscoding
the use cases are particular. Like you don’t want a proper database but you need sql speeds on complex queries. I almost dropped it completely but just a few days ago I thought about using it for the dataset I’m generating. In the end, I got more speed ups out of other things.
@Laz4rz
@pixqc
@wateriscoding
It allows for sql queries (pandas as well) and it’s pretty fast at it. I don’t think it drops everything into memory but I didnt profile it.
Overall, these results are fantastic for the open-source community, demonstrating that the gap between closed and open-source models is continuously shrinking! 🌐👏
#OpenSource
#AI
#Innovation
For the Absent Answer Detection task, idefics2-chatty is the best model of its class! It even beats GPT-4o and Gemini pro in the hardest setting! 🏆
#MachineLearning
#AI
#NLP
I'm creating question/answer pairs from documents using an LLM. It mostly works well, but the number of questions doesn't scale linearly if I feed the model several pages. I tend to split the pages, but the context becomes less rich. Any ideas?
We fine-tuned Florence-2 with a small learning rate and froze the vision encoder. Our experiments ranged from a single A100 GPU setup to a powerful 8 H100 GPU cluster. Showing the potential for small setups 🚀💻
@PriNova75
@huggingface
Right? It's crazy! I'm happy that at least we track it and show it on all jobs we are running to create some awareness among engineers. Also there is some incredible work by
@SashaMTL
that keeps us in check. But yes, the carbon emissions of ML training are nuts.
For the Incompatible Answer Set Detection, idefics2-chatty performs exceptionally well, outperforming Gemini Pro and LlaVA-1.6-13B for base type questions, though it falls behind for more complex queries. 📊
#AIResearch
#Tech
@xhinker
Hi Andrew! There’s a bug in the original implementation that doesn’t allow models to be fine tuned. We opened PRs on all the models to fix them, but you need to point the hub to our prs or use the model I uploaded. Did you change the model you were fine tuning?
Excited to share my journey in Machine Learning and Sound Engineering! 🌟 With 10 years of experience and a PhD, from Argentina to Switzerland, speaking four languages and always learning. Let's innovate together! 🚀🎧
#MachineLearning
#SoundEngineering
#TechForGood
@SanhEstPasMoi
@Grammarly
I've been using it for two years now, and it has improved my messages and taught me to write better. It's super helpful when it says, 'Remove this for confidence.'
My only two quarrels: 1) It constantly wants to delete my emojis :'( 2) Only English :(
@bigemptyboulder
@huggingface
I had six rounds of interviews but they were all pretty fair/on point and not very time consuming. Tips for a successful application are to make cool stuff you can showcase such that your profile stands out 🤗🚀
We fine-tuned the model on DocVQA, a hard benchmark, and it achieves 57% ANSL - outperforming Idefics2 (without fine-tuning) and DeepSeek-VL (with fine-tuning) 🤯
@mervenoyann
@pebkac_roll
And we’re open about putting them to good use! If you have an impactful open source project and need some compute power, definitely jump in my dms 🚀
In the Incompatible Visual Question Detection, idefics2-chatty matches the best model so far in the <10B class, Qwen-VL-Chat, but lags behind larger, closed-source models. Still, an impressive feat for open-source! 🎨
#ComputerVision
#ML
@Laz4rz
@pixqc
@wateriscoding
I tried to use polars on production and it would be a bit buggy in unexpected ways. To me, polars shines when you need to process millions of rows, specially recurrently. It’s an use case that happens in many ml pipelines
@effemanfredi
Yes, the model has some custom code from Microsoft and at some point it sets the model_type to an empty string. It’s just a bug but hard to squash 😅
I added Lora training to the florence2 fine tuning repo. I was expecting that with Lora I would be able to fit way larger batch size but reducing to 1% of trainable params I can fit 25% more samples per batch. Is that normal?
🚀 Exciting news! The MM-UPD benchmark is now live on
@huggingface
Hub as a leaderboard 🏆. It evaluates how vision language models handle unsolvable problems 🤓. Currently, top VLMs like GPT-4V and LLaVA-Next-34B struggle with it.
🔗 Link in the next tweet!
@LukaszBorchmann
@mervenoyann
@skalskip92
Hi 👋 I don't think a text-only model would achieve anything in this dataset. Where did you see a BERT base performing so well? I'm interested!
We didn't target SOTA performance here; performance would improve training on a larger dataset like The Cauldron.
EPFL and Apple just released 4M-21: single any-to-any model that can do anything from text-to-image generation to generating depth masks! 🙀
Let's unpack 🧶
@mervenoyann
@huggingface
Awesome leaderboard! Interesting to see that there is only one model <10B
Seems like there are still some room for improvements in the field :)
For the tutorial on fine-tuning Florence-2. Would you be more interested in seeing how to do it with a collab notebook or how to do it with a multi-GPU setup to get better performance?
Given the rise of big LLMs (100B+ params), it would be exciting to see more open-source experiments in distillation🥁
I'm happy to announce a 1000 A100 hours grant for open access work in the LLM distillation space 🤗
Examples:
- Distilling an 8B model into <1B model
- Writing
Here are a number of comparisons. For its size, it's great in captioning, but the larger models perform better.
It does best in visual question answering. Large models perform sometimes better, but not always.
It's SOTA on Referring Expression Comprehension.
@Laz4rz
@Nigh8w0lf
@mervenoyann
@skalskip92
Yeah, they are super different annotations and results. But if you need to fine tune for captioning I would still start from one of those. Try to get a feeling for which one is closest to what you want.
Have you tried to answer MMMU questions? They are frigging hard!
At least the paper mentions that college experts achieve 76-82% accuracy, so it's not just me that's dumber than an LLM xD
As a software engineer with 10 yoe, I use LLMs to code complex systems all the time. It’s like I have a small army of juniors that code whatever I tell them to and then I only need to review the code and put the pieces together.
@Laz4rz
Whenever you’re limited by I/O. Here it takes some time for me to load the images and submit them to the gpus, so instead of waiting for the images to be loaded, I load them in parallel while I wait for the gpus to finish their tasks
@txhno
So far it's been on par with pip. Like pip-compile, uv generates a platform-specific requirements.txt file (unlike, e.g., poetry and pdm, which generate platform-agnostic poetry.lock and pdm.lock files).
@realmrfakename
@huggingface
For now multimodal as in images+text, but I believe the field will naturally evolve to integrate audio and video to the mix 🌈
Tomorrow I'm getting certified as a climbing guide with the Swiss alpine club! I'm going on a multi-pitch climb with a group and made this nice poster as marketing :)
@Laz4rz
Yes, if you use torch then this is already optimized in the DataLoader and trainer loops. I'm using HFs llm-swarm () to spawn "servers" with LLMs where I submit tasks. Creating the tasks takes time, and I do it while I wait for the servers to respond.
@m_chirculescu
PaliGemma is a larger model, and it is already pre-trained for VQA, so it does better out of the box. However, you need more computer power to fine-tune PaliGemma on your particular dataset, so I would choose Florence if you need to fine-tune it.