tomaarsen @tomaarsen profile

tomaarsen

@tomaarsen

Followers

2,349

Following

203

Media

112

Statuses

505

Sentence Transformers, SetFit & NLTK maintainer Machine Learning Engineer at 🤗 Hugging Face

https://t.co/Y4DfYcksyl

Netherlands

Joined December 2023

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

Rogan • 779455 Tweets

Beyoncé • 643212 Tweets

Hybe • 547556 Tweets

Big Winner • 466034 Tweets

最後の最後 • 461622 Tweets

CWR On ONE BKK Stage • 237259 Tweets

#PBBGen11BigNight • 222844 Tweets

Fyang • 213821 Tweets

SB19 ANIMBERSARY CONCERT • 205777 Tweets

Halloween with DAOU • 131235 Tweets

CONSTRUCTIVE DREAMSCAPE 1 • 120464 Tweets

FY RECAP BLANK SS2EP5 • 109703 Tweets

AioonMay Be Together • 96570 Tweets

国民民主 • 60741 Tweets

#祝NIKKE2周年生放送 • 57397 Tweets

#SixTONESANN • 47782 Tweets

TXT SANCTUARY PREVIEW • 46810 Tweets

#UFC308 • 43925 Tweets

シンデレラ • 37059 Tweets

石丸さん • 31672 Tweets

石丸伸二 • 31447 Tweets

玉木さん • 29651 Tweets

Haaland • 28199 Tweets

ホークス • 24928 Tweets

早じまい • 20428 Tweets

経費節減 • 19314 Tweets

Southampton • 18133 Tweets

Leal • 13391 Tweets

ライデル

마드리드

ナヒーダ誕生日

グレイブ

Welbeck

凪砂くん

Di Lorenzo

石田の完聖体

わこじぇる

スズランママ

ほなちゃん

第1010回

ウェルベック

Ipswich

ウッディ

Rinat

あでぃしょ

たかほー

ホエールズ

ファーガソン

حسين ال

#ديار_تزهاك

Last Seen Profiles

@CourageEvents

@Cryptosyssys

@hoops_kentucky

@danielcstaq

@jkarin121910

@mafeilong

@sunshine_alwayu

@Albeit450

@KarinaVilcacas1

@Cerdo_Sevilla

@EleanorK46

@_taurusbae

@PabloMartinoLS

@vseviov

@arifatal5

@jackgiraffe1

@cfp_psicologia

@Cale_williams0

@bentd76

@CAVS4MVP

Pinned Tweet

tomaarsen

@tomaarsen

16 days

📣 Sentence Transformers v3.2.0 is out, marking the biggest release for inference in 2 years! 2 new backends for embedding models: ONNX (+ optimization & quantization) and OpenVINO, allowing for speedups up to 2x-3x + Static Embeddings for 500x speedups at 10-20% accuracy cost 🧵

9

80

434

tomaarsen

@tomaarsen

5 months

‼️Sentence Transformers v3.0 is out! You can now train embedding models with multi-GPU training, bf16 support, loss logging, callbacks & much more. I also release 50+ datasets to train on & much more. Learn how to use the new Trainer here: Details in 🧵

Training and Finetuning Embedding Models with Sentence Transformers v3

huggingface.co

9

103

425

tomaarsen

@tomaarsen

7 months

Embedding Quantization is here! 25x speedup in retrieval; 32x reduction in memory usage; 4x reduction in disk space; 99.3% preservation of performance🤯 The sky is the limit. Read about it here: More info in 🧵

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

huggingface.co

5

93

375

tomaarsen

@tomaarsen

8 months

🔥Sentence Transformers v2.4.0 is released! It introduces Matryoshka Embedding models (training & inference), 2 new state-of-the-art loss functions, prompt templates, instructor model support & more. See the🧵

6

93

375

tomaarsen

@tomaarsen

12 days

Model2Vec distills a fast model from a Sentence Transformer by passing its vocabulary through the model, reducing embedding dims via PCA and applying Zipf weighting. Inference with the resulting static embeddings are lightning-fast, e.g. 10k texts/sec: 🧵

Model2Vec: Distill a Small Fast Model from any Sentence Transformer

huggingface.co

4

50

241

tomaarsen

@tomaarsen

9 days

Label a few examples; train a small classifier; outperform LLMs 500x larger & slower for your classification task. Sounds difficult? It's not. Check out this blogpost using Argilla, distilabel, and SetFit to try it yourself:

How to build a custom text classifier without days of human labeling

huggingface.co

2

31

231

tomaarsen

@tomaarsen

11 days

🏎️ I've just finished uploading ~1000 ONNX models & ~100 OpenVINO models for a ton of Sentence Transformers models. Use them with: `SentenceTransformer("all-MiniLM-L6-v2", backend="onnx", model_kwargs={"file_name": "model_O4.onnx"})` for 1.3x to 2.5x speedups! 🧵

5

24

199

tomaarsen

@tomaarsen

2 months

Sentence Transformers v3.1 is out! Featuring a hard negatives mining utility to get better models out of your data, a new strong loss function, training with streaming datasets, custom modules, bug fixes, small additions and docs changes. Release notes: 🧵

5

34

175

tomaarsen

@tomaarsen

1 month

🎉SetFit v1.1.0 is out! Training efficient classifiers on CPU or GPU now uses the Sentence Transformers Trainer, and we resolved a lot of issues caused by updates of third-party libraries (like Transformers). Full release notes: More details in 🧵

2

28

169

tomaarsen

@tomaarsen

9 days

ONNX, Quantization, OpenVINO, Optimization, Lower precision, etc. are all rather confusing. Luckily, for Sentence Transformers you can just follow this flowchart based on my personal benchmarks. Docs:

6

25

169

tomaarsen

@tomaarsen

5 months

78.17% -> 93.40% accuracy by finetuning an embedding dataset on a small synthetic dataset with Sentence Transformers v3! Great work 👏

Daniel van Strien

@vanstriendaniel

5 months

Thanks to @tomaarsen 's work in the latest Sentence Transformers release, training custom models is easier than ever. With improved training support and synthetic data for fine-tuning, you can build a model in less than a day. Example here👇:

1

8

77

0

11

144

tomaarsen

@tomaarsen

8 months

Biggest release of the week: just released 'mxbai-embed-large-v1', a top embedding model outperforming all equivalently sized models such as 'bge-large-en-v1.5'. It's Apache-2.0 licensed, i.e. commercially viable. Model link: 🧵1/5

3

22

140

tomaarsen

@tomaarsen

17 days

Jack's excellent, state-of-the-art CDE-small-v1 is now compatible with Sentence Transformers with a really simple interface, check it out:

jack morris

@jxmnop

17 days

man, the huggingface team is *cracked* two days after i released my contextual embedding model, which has a pretty different API, @tomaarsen implemented CDE in sentence transformers you can already use it implementation was not at all trivial; those people just work fast

18

101

1K

2

10

138

tomaarsen

@tomaarsen

8 days

As far as attention-free, static embedding models solutions go, Model2Vec is the new absolute king, beating out BPEmb and GLoVe on various domains. Learn how it works here: P.s.: Integrated in Sentence Transformers, LangChain, LlamaIndex, Haystack, etc.

2

23

197

tomaarsen

@tomaarsen

11 months

Thrilled to share that I've joined @huggingface 🤗 as a Machine Learning Engineer tasked with maintaining the awesome Sentence Transformers project! It's high time to bring modern training functionality to finetuning embedding models!⚡️

5

11

124

tomaarsen

@tomaarsen

7 months

GLiNER has new Apache 2.0 models for efficient, cheap and high quality information extraction. > 2 English models: small (152M) and medium (195M) > 1 Multilingual model (288M) > 25x-50x smaller and faster than any 7b LLM > 3 lines of code Demo on CPU:

3

33

122

tomaarsen

@tomaarsen

9 months

The long-awaited Sentence Transformers v2.3.0 is now released! It contains a ton of bug fixes, performance improvements, loading custom models, more efficient loading, a new strong loss function & more! Check out the release notes: Or this 🧵below:

Release v2.3.0 - Bug fixes, improved model loading & Cached MNRL · UKPLab/sentence-transformers

This release focuses on various bug fixes & improvements to keep up with adjacent works like transformers and huggingface_hub. These are the key changes in the release: Pushing models to the Hu...

github.com

5

24

107

tomaarsen

@tomaarsen

7 months

Alongside our blogpost on Embedding Quantization, we released a useful demo showcasing that it allows for <0.1s retrieval across all of Wikipedia (41 million texts) while using 32x less memory than normal retrieval. E.g. <10GB memory rather than 160GB 💸

3

9

96

tomaarsen

@tomaarsen

7 months

Big update for the Massive Text Embedding Benchmark (MTEB) intended to simplify finding a good embedding model! Model filtering, search, memory usage, model size in parameters. The updated leaderboard: Details in 🧵:

2

22

91

tomaarsen

@tomaarsen

5 months

@1littlecoder My take on this

5

3

87

tomaarsen

@tomaarsen

5 months

Synthetic data used to improve an embedding model from 78.1% accuracy -> 93.4% accuracy with Sentence Transformers finetuning! Learn how to do this yourself ⏬ Great work, Daniel!

Daniel van Strien

@vanstriendaniel

5 months

Do you need a dataset to train a custom sentence transformer model? I've created a pipeline for using an LLM to create a synthetic dataset you can directly use for fine-tuning/training a Setence Transformers model. *Link in next tweet

3

19

103

5

9

84

tomaarsen

@tomaarsen

1 month

I've just shipped the Sentence Transformers v3.1.1 patch release, fixing the hard negatives mining utility for some models. This utility is extremely useful to get more performance out of your embedding training data. Release notes: More info in 🧵

4

13

83

tomaarsen

@tomaarsen

17 days

Sentence Transformers just crossed ⭐15k stars on GitHub! Great timing too, we're about to ship a huge update: 2 new backends for computing embeddings via ONNX and OpenVINO, natively in Sentence Transformers. Expect faster inference very soon 🏎️!

1

3

82

tomaarsen

@tomaarsen

8 months

released 3 state-of-the-art open text reranker models: fully Apache 2.0 & outperforming current top models such as bge-reranker-large and cohere-embed-v3 on BEIR datasets. All models are ready to use on the @huggingface Hub! More details in 🧵

3

18

65

tomaarsen

@tomaarsen

1 month

Jina AI has just released jina-embeddings-v3, a multilingual text embedding model optimized for a wide range of NLP tasks! It ranks very strongly on MTEB ( #2 across all tasks for <1B param models). Model link🤗: Details in 🧵

1

7

62

tomaarsen

@tomaarsen

6 months

2 new tiny Apache 2.0 reranker models just got released by @JinaAI_ . Despite their small size/latency, they perform competitively on benchmarks, reportedly outperforming bge-reranker-base and mxbai-rerank-base on MTEB Retrieval. Models: Details in 🧵

3

15

61

tomaarsen

@tomaarsen

6 months

Snowflake has just shook up the MTEB Retrieval leaderboard with 5 new model releases: > 23, 33, 110, 137 & 335M parameters > Apache 2.0 license > SOTA performance for their sizes/speeds > 512 & 8192 seq length Technical report coming soon. Model link:

3

8

60

tomaarsen

@tomaarsen

26 days

We just crossed 9000 public Sentence Transformer models on the @huggingface Hub! Big ups to everyone finetuning their own models 🤗

3

6

53

tomaarsen

@tomaarsen

2 months

There's new embedding models for identifying instances where LLM instructions and responses don't align, by @mrm8488 . They're useful for quality assurance, LLM training dataset filtering, retrieval, reward model in RLHF, etc. @huggingface : Details in 🧵

2

5

53

tomaarsen

@tomaarsen

1 month

@juliuslipp For those unaware of batched:

GitHub - mixedbread-ai/batched

Contribute to mixedbread-ai/batched development by creating an account on GitHub.

github.com

2

4

50

tomaarsen

@tomaarsen

5 months

Sentence Transformers just reached 14k stars! Just in time for the upcoming v3.0 update 👀It'll be the biggest update since the inception of the project. More details coming soon!

0

43

tomaarsen

@tomaarsen

6 months

Sentence Transformers v2.7.0 is out! Featuring a new loss function, easier Matryoshka model inference & evaluation, CrossEncoder improvements & Intel Gaudi2 Accelerator support. Release notes: Or read the details in 🧵

Release v2.7.0 - CachedGISTEmbedLoss, easy Matryoshka inference & evaluation, CrossEncoder, Intel...

This release introduces a new promising loss function, easier inference for Matryoshka models, new functionality for CrossEncoders and Inference on Intel Gaudi2, along much more. Install this versi...

github.com

1

8

37

tomaarsen

@tomaarsen

5 months

Just published Sentence Transformers v3.0.1: the first patch release since v3 from last week. It introduces gradient checkpointing, pushing model checkpoints to Hugging Face while training, model card improvements and fixes. Release notes: Details in 🧵

Release v3.0.1 - Patch introducing new Trainer features, model card improvements and evaluator...

This patch release introduces some improvements for the SentenceTransformerTrainer, as well as some updates for the automatic model card generation. It also patches some minor evaluator bugs and a ...

github.com

1

2

37

tomaarsen

@tomaarsen

11 months

📈 There's almost 4000 open source Sentence Transformers models on the Hugging Face Hub right now! Open source for the win❤️

5

8

34

tomaarsen

@tomaarsen

3 months

Recently, @mixedbreadai and @deepset_ai collaborated on a SOTA German text embedding model, outperforming multilingual-e5-large and jina-embeddings-v2-base-de. Link: Details: - 478M parameters: small enough to run on CPU and GPU 🧵

1

7

34

tomaarsen

@tomaarsen

7 months

The last Sentence Transformers release introduced GISTEmbedLoss by @avsolatorio , which allows for training models that outperform those trained by the wildly popular in-batch negatives loss (MultipleNegativesRankingLoss). Learn about it in this 🧵(links at the bottom):

3

8

33

tomaarsen

@tomaarsen

4 months

Absolutely loving the flexibility of Sentence Transformers v3 for training embedding models - allows for much easier paper reproductions.

2

29

tomaarsen

@tomaarsen

17 days

@jxmnop I'm glad I could showcase the flexibility of the Custom Modules from the latest Sentence Transformers v3.1 update here! The final API is really simple; I'm quite happy with it.

0

29

tomaarsen

@tomaarsen

11 months

🤗 The long-awaited full release of SetFit is finally out! SetFit v1.0.0 brings an all-new Trainer, TrainingArguments, logging, evaluation, integrations, callbacks, model cards, docs & more! 1/6

1

9

27

tomaarsen

@tomaarsen

1 month

Nice! You can now link from older @huggingface models to their newer counterparts with a neat widget. As simple as adding the "new_version" metadata in the README.

1

4

25

tomaarsen

@tomaarsen

2 months

Carbon emissions are automatically included with Sentence Transformer training if you install codecarbon: pip install codecarbon E.g. see

Aymeric

@AymericRoucher

2 months

New feature on the Hub! ☁️ Carbon emissions emitted during training now show up on the model card! (requires model authors to fill that info first) Hopes it will prompt more people to show the carbon emissions of their model training! 🌍 Thanks a lot to the team who pushed

1

8

29

0

7

25

tomaarsen

@tomaarsen

8 months

Matryoshka embedding models can produce useful embeddings of various dimensions, which can heavily speed up downstream tasks like retrieval (e.g. for RAG). Check out our blogpost with all of the details:

🪆 Introduction to Matryoshka Embedding Models

huggingface.co

1

3

20

tomaarsen

@tomaarsen

5 months

You can now finetune embeddings models with Sentence Transformers & AutoTrain without writing any code! Works locally, in Google Colab, on any cloud or via Hugging Face Spaces. Check it out!

abhishek

@abhi1thakur

5 months

🚨 NEW TASK ALERT 🚨 AutoTrain now supports fine-tuning of sentence transformer models 💥 Now, you can improve and customize your RAG or retrieval models without writing a single line of code 🤗 ✅ Supports multiple types of sentence transformers training and finetuning ✅ CSV /

3

13

51

1

4

19

tomaarsen

@tomaarsen

4 months

I'm absolutely loving these new dataset size markers on @huggingface datasets 👏

1

3

20

tomaarsen

@tomaarsen

1 month

The 2000th public SetFit model was uploaded this morning! Awesome to see so many people enjoy SetFit for training efficient classifiers with very little training data 🤗

1

3

19

tomaarsen

@tomaarsen

6 months

@bo_wangbo @JinaAI_ Also, I would like to implement ColBERT training into Sentence Transformers (based on the HF Trainer with MultiGPU, bf16, callbacks + integrations, useful model cards, etc.), so I'm looking forward to your findings & promising loss functions there.

1

2

19

tomaarsen

@tomaarsen

16 days

I'm currently in the process of uploading over 1000 ONNX (normal, optimized, quantized) and OpenVINO models, across all models under 🧵

1

0

17

tomaarsen

@tomaarsen

15 days

Just crossed 2000 followers on here - many thanks! I'll be sure to keep you all up to date on the latest in the world of Sentence Transformers and SetFit, as well as inform on embedding models and information retrieval/search systems in general! Looking forward to 3k 🤗

0

16

tomaarsen

@tomaarsen

8 months

We also reduced the dependencies & made many more small changes. See the release notes for all of this information in more detail: I'm looking forward to seeing your models pop up on the Hub!🤗See you in the next release!

Release v2.4.0 - Matryoshka models, SOTA loss functions, prompt templates, INSTRUCTOR support ·...

This release introduces numerous notable features that are well worth learning about! Install this version with pip install sentence-transformers==2.4.0 MatryoshkaLoss (#2485) Dense embedding mode...

github.com

1

0

16

tomaarsen

@tomaarsen

7 months

@huggingface @mixedbreadai This quantization is included in today's v2.6.0 Sentence Transformers release. Read the release notes here:

Release v2.6.0 - Embedding Quantization, GISTEmbedLoss · UKPLab/sentence-transformers

This release brings embedding quantization: a way to heavily speed up retrieval & other tasks, and a new powerful loss function: GISTEmbedLoss. Install this version with pip install sentence-tr...

github.com

1

16

tomaarsen

@tomaarsen

11 months

SetFit is extremely well suited for zero-shot text classification, and often outperforms much larger (and slower) zero-shot models on the Hugging Face Hub! Check out our new how-to guide for zero-shot text classification here:

1

2

15

tomaarsen

@tomaarsen

16 days

I really think that finetuned Static Embeddings + a cross-encoder reranker can be a very solid solution for efficient search. More info on that soon! See here a finetuned Static Embedding model as an example: 🧵

tomaarsen/static-bert-uncased-gooaq · Hugging Face

huggingface.co

1

15

tomaarsen

@tomaarsen

8 months

CoSENTLoss is a new drop-in replacement of the popular CosineSimilarityLoss, while produces a stronger training signal. AnglELoss is another variant which uses a different similarity function to avoid vanishing gradients. See the loss docs for more info:

1

0

15

tomaarsen

@tomaarsen

11 days

Read more about how to speed up Sentence Transformer models with lower precision, ONNX, OpenVINO, etc. in the "Speeding up Inference" documentation:

1

2

15

tomaarsen

@tomaarsen

8 months

@mixedbreadai released a🪆2D Matryoshka text embedding model. Such models have two notable properties: Adaptable embedding size & adaptable layer counts. It allows you to speed up both inference & all post-processing (e.g. retrieval). Model link: See 🧵

1

4

14

tomaarsen

@tomaarsen

11 days

P.s. here's a gist for how to export all of these models:

Export Sentence Transformer models to ONNX (+ optimization, quantization) & OpenVINO

Export Sentence Transformer models to ONNX (+ optimization, quantization) & OpenVINO - export_locally.py

gist.github.com

0

2

14

tomaarsen

@tomaarsen

5 months

An excellent training script for embedding models via the new Sentence Transformers v3 by @mrm8488 . Try it out yourself!

Manu Romero

@mrm8488

5 months

🚀 Just out: Sentence-Transformers 3 is transforming the game! Kudos to @tomaarsen for the stellar update. 🌟 🔥 NEW FEATURE: Train your own Matryoshka embedding models! Want to dive in? I've set up a Colab notebook to get you started right away. Check it out and start creating

0

12

83

1

3

13

tomaarsen

@tomaarsen

16 days

1️⃣ ONNX Backend: This backend uses the ONNX Runtime to accelerate model inference on both CPU and GPU, reaching up to 1.4x-3x speedup depending on the precision. We also introduce 2 helper methods for optimizing and quantizing models for (much) faster inference.

1

0

13

tomaarsen

@tomaarsen

9 days

Read the blogpost below for a wonderful beginner-friendly introduction to Hugging Face Spaces to make your own AI/ML Web Apps!

Ellie Sleightholm

@elsleightholm

9 days

getting to deploy open source, machine learning demos on @huggingface spaces really is such a perk of my job if you want to learn how you can do this, i created a blog post for beginners :) ps it uses @Gradio

5

20

92

2

3

13

tomaarsen

@tomaarsen

12 days

I integrated it in Sentence Transformers, so you can save and load all Model2Vec models with Sentence Transformers directly, also via LangChain, LlamaIndex, Haystack, etc. ST Docs: ST Release notes: 🧵

1

2

12

tomaarsen

@tomaarsen

7 months

@huggingface @mixedbreadai The future of search is int8 & binary.

1

0

12

tomaarsen

@tomaarsen

7 months

The Massive Text Embedding Benchmark (MTEB) is being extended to become massively multilingual. Everyone is invited to contribute & co-author an upcoming publication. 📜 Details:

Kenneth Enevoldsen

@KCEnevoldsen

7 months

🚀 Exciting News! We're launching MMTEB, the Multilingual Massive Text Embedding Benchmark. A community initiative to make text embeddings more inclusive & diverse. Join us in expanding the coverage of NLP to a wide range of languages! 🌍 #MMTEB #NLP

2

8

43

0

3

12

tomaarsen

@tomaarsen

5 months

Since yesterday's Sentence Transformers v3.0 update, distributed training of embedding models (for RAG, retrieval, semantic similarity, etc.) is now a breeze. You can expect some serious speedups when scaling the number of GPUs. Usage in 🧵

1

12

tomaarsen

@tomaarsen

7 months

. @huggingface and @mixedbreadai announce embedding quantization: a post-processing technique for embeddings that results in massive cuts in costs for retrieval. E.g., rather than needing 200GB, we can search Wikipedia in this demo with just 5GB of RAM: 🧵

Quantized Retrieval - a Hugging Face Space by sentence-transformers

huggingface.co

2

12

tomaarsen

@tomaarsen

11 months

The new SetFit v1.0.0 release also brings SetFitABSA: Few-Shot Aspect Based Sentiment Analysis! ABSA is like Sentiment Analysis, except it tells you which parts people were happy/unhappy about, it's extremely useful! Check out the blogpost: 1/3

SetFitABSA: Few-Shot Aspect Based Sentiment Analysis using SetFit

huggingface.co

1

3

12

tomaarsen

@tomaarsen

16 days

Check out the full release notes for much more details: Or read about Speeding up Inference in the new documentation page: Looking forward to some solid feedback on this release!

Release v3.2.0 - ONNX and OpenVINO backends offering 2-3x speedup; Static Embeddings offering...

This release introduces 2 new efficient computing backends for SentenceTransformer models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; static embeddings ...

github.com

0

4

11

tomaarsen

@tomaarsen

5 months

AutoTrain now supports finetuning embedding models using Sentence Transformers! With other words: embedding models for your data without having to write any code. Details: More in 🧵

How to Fine-Tune Custom Embedding Models Using AutoTrain

huggingface.co

1

10

tomaarsen

@tomaarsen

4 months

@ClementDelangue @huggingface 🤗

0

12

tomaarsen

@tomaarsen

8 months

Recently, consensus has developed that larger sequence lengths result in notably worse embeddings, so this model uses a more reasonable 512. 🧵3/5

1

10

tomaarsen

@tomaarsen

8 months

@1littlecoder Embedding models are explicitly trained such that cosine similarity becomes a strong measure of semantic similarity. For all real-world embedding models, the findings of this paper do not apply at all. You can keep safely using cosine similarity.

1

0

11

tomaarsen

@tomaarsen

5 months

@n0riskn0r3ward @nvidia To not get you too excited, there are a few concerns with this model at this point. The 1st place is mostly due to its high score on classification (87.35 vs 81.49 for #2 ), which is because it scores unexpectedly high on a few of the datasets, notably the EmotionClassification.

2

0

11

tomaarsen

@tomaarsen

5 months

@eugeneyan It seems rather challenging to access the underlying ClueWeb22 dataset (), but I would love love love to get this dataset on the Hub and in here:

Embedding Model Datasets - a sentence-transformers Collection

huggingface.co

1

10

tomaarsen

@tomaarsen

9 months

🔥By applying optimum-intel we can get a 3.5x increase in throughput for SetFit text classification models on CPUs. It applies quantization using the Intel Neural Compressor (INC), resulting in higher throughput on CPUs than with torch on GPUs. Notebook:

2

10

tomaarsen

@tomaarsen

16 days

Usage is as simple as `SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")`. Does your model not have an ONNX or OpenVINO file yet? It'll be autoexported for you. Thank me later 😉 And if you don't want to export it every time, just use "push_to_hub" or "save_pretrained". 🧵

1

0

9

tomaarsen

@tomaarsen

16 days

🔒 Another major new feature is Static Embeddings: think word embeddings like GLoVe and word2vec, but modernized. Static Embeddings are bags of embeddings that are summed together to create text embeddings, allowing for lightning-fast embeddings without any neural networks. 🧵

1

0

10

tomaarsen

@tomaarsen

5 months

This v3.0 release has been the biggest in the Sentence Transformers history (13k lines changed, 292 files updated), and I'm very excited to see it come to fruition. I'm very much looking forward to seeing your finetuned models on @huggingface 🧵

1

9

tomaarsen

@tomaarsen

8 months

We also support Prompt Templates now! Useful for those models that always need prompts before the text (e.g. "query: ..." or "Represent this sentence for searching relevant passages: "). Learn more about it here:

1

0

9

tomaarsen

@tomaarsen

16 days

2️⃣ Random initialization. This requires finetuning, but finetuning is extremely quick (e.g. 3 million pairs in 7 minutes). My final model was 6.6% worse than bge-base-en-v1.5, but 500x faster on CPU (yes: 500x. I'm talking 14000 sentences per second compared to 24). 🤯 🧵

1

9

tomaarsen

@tomaarsen

5 months

@jobergum My favourite part: The all-new automatically generated model cards:

tomaarsen/mpnet-base-all-nli-triplet · Hugging Face

huggingface.co

2

0

9

tomaarsen

@tomaarsen

16 days

1️⃣ via Model2Vec, a new technique for distilling a Sentence Transformer models into static embeddings. Either via a pre-distilled model with `from_model2vec` or with `from_distillation` where you do the distillation yourself. It'll only take 5 seconds on GPU & 2 minutes on CPU 🧵

1

0

9

tomaarsen

@tomaarsen

6 months

@orionweller @srchvrs @n0riskn0r3ward @spacemanidol @memray0 @Quantum_Stat @bo_wangbo Sentence Transformers v3 will completely overhaul the training loop. I'm sure it'll include just about anything you'd need, from Multi-GPU w. DDP, GradCache losses, W&B/Tensorboard integration, extensive model card generation, FA2 on various models (more coming soon), etc.

1

0

9

tomaarsen

@tomaarsen

6 months

@bo_wangbo @JinaAI_ Looking forward to those papers! Especially your plans for v3 & collecting parallel data. On the topic of the latter, I just reformatted 10 Parallel Sentence datasets for easy use with Sentence Transformers v3:

Hugging Face – The AI community building the future.

huggingface.co

1

0

8

tomaarsen

@tomaarsen

5 months

@_philschmid Link to docs 🤗:

0

9

tomaarsen

@tomaarsen

5 months

5️⃣ Dataset Release To help you out with finetuning models, I've released 50+ ready-to-go datasets that can be used with training or finetuning embedding models. Check them out here: 🧵

Embedding Model Datasets - a sentence-transformers Collection

huggingface.co

1

0

8

tomaarsen

@tomaarsen

9 days

Or skip straight to the Notebook that you can run in Colab:

argilla-cookbook/argilla_labeller_setfit.ipynb at main · argilla-io/argilla-cookbook

Simple examples using Argilla tools to build AI. Contribute to argilla-io/argilla-cookbook development by creating an account on GitHub.

github.com

0

8

tomaarsen

@tomaarsen

1 month

I love this; roast your (or someone else's... 👀) @huggingface profile with this Space by @enzostvs :

2

1

8

tomaarsen

@tomaarsen

5 months

... - Improved callback support + an excellent Weights & Biases integration - Gradient checkpointing, gradient accumulation - Improved model card generation - Resuming from a training checkpoint without performance loss - Hyperparameter Optimization and much more! 🧵

1

7

tomaarsen

@tomaarsen

7 months

🔍 Time for some sneak previews! Soon, models trained/finetuned with Sentence Transformers will automatically include detailed model cards! In this 🧵I'll show what's included: - Model Details, e.g. base model, sequence length, output dimensionality, training datasets, language.

1

8

tomaarsen

@tomaarsen

5 months

@huggingface And stay tuned for future updates, I've got big plans and plenty of motivation to make 'em happen. Check out the repository to keep up to date or to submit issues/feature requests/pull requests:

GitHub - UKPLab/sentence-transformers: State-of-the-Art Text Embeddings

State-of-the-Art Text Embeddings. Contribute to UKPLab/sentence-transformers development by creating an account on GitHub.

github.com

0

8

tomaarsen

@tomaarsen

8 months

Additionally, we now support the popular INSTRUCTOR models, such as . Check out the documentation on how to use these models:

hkunlp/instructor-large · Hugging Face

huggingface.co

1

0

8

tomaarsen

@tomaarsen

2 months

@bclavie @antoinelouis_ Looking to integrate this into Sentence Transformers in the long run, but it likely has lower priority than improving CrossEncoder training & 1-st party ColBERT.

1

0

8

tomaarsen

@tomaarsen

10 months

SetFit v1.0.2 is out to fix some v1.0 release bugs: incorrect model cards when using custom metrics, multi-output mixed with predict_proba, the "unique" sampler, and predicting polarities of gold aspect spans in SetFit ABSA models. Check the repo here:

GitHub - huggingface/setfit: Efficient few-shot learning with Sentence Transformers

Efficient few-shot learning with Sentence Transformers - huggingface/setfit

github.com

1

0

7

tomaarsen

@tomaarsen

6 months

@jobergum Including ONNX export 😉it's all on the todo-list

1

0

7

tomaarsen

@tomaarsen

11 months

Extremely excited to have contributed this implementation! I really think Attention Sinks will be huge to bring constant memory & constant fluency to LLMs.

1

0

7

tomaarsen

@tomaarsen

8 months

@llama_index @LangChainAI @deepset_ai Check out the release blogpost by @mixedbreadai to learn more details: 🧵5/5

Open Source Strikes Bread - New Fluffy Embedding Model - Blog

Our English embedding model provides state-of-the-art performance among other efficiently sized models. It outperforms closed source models like OpenAI's text-embedding-v3.

www.mixedbread.ai

0

2

7

tomaarsen

@tomaarsen

7 months

Would you look at that, in the meantime @urchadeDS uploaded a third English model: You know models are fresh when they're still being created during the announcements😄

urchade/gliner_large-v2.1 · Hugging Face

huggingface.co

0

7

tomaarsen

@tomaarsen

9 months

📉 A new "Cached" variant of the powerful Multiple Negatives Ranking Loss allows normal hardware to get performance that used to only be viable on multi-gpu clusters. 🐎 Community Detection is now much faster (7x speedup at 500k sentences 🤯)

1

0

7

tomaarsen

@tomaarsen

11 days

What should I prioritize next? Cross-Encoder (i.e. reranker) training? Cross-Encoder with ONNX/OpenVINO backends? LoRA training?

5

0

7

tomaarsen

@tomaarsen

16 days

2️⃣ OpenVINO Backend: This backend uses Intel their OpenVINO instead, outperforming ONNX in some situations on CPU. For CPU, the ONNX int8 quantized model can reach upwards of 3x speedup compared to the default fp32! 🏎️ 🧵

1

0

7

tomaarsen

@tomaarsen

7 months

@iamrobotbear @ClementDelangue They do, the new models (v2.1) are all Apache 2.0, which allows for commercial use: - - - -

urchade/gliner_large-v2.1 · Hugging Face

huggingface.co

0

6