A mathematician dabbling in the world of data science. Researcher at the Tutte Institute for Mathematics and Computing. UMAP, HDBSCAN, PyNNDescent. He / Him.
Our paper on UMAP, a faster alternative to t-SNE, is now up on arXiv! The paper provides a more detailed account of the theoretical underpinnings of the algorithm, as well as performance benchmarks.
The first release candidate for UMAP 0.4 is out providing lots of new features, including performance improvements, embedding to different manifolds, inverse transform, and plotting tools.
The latest version of umap-learn is now out. Version 0.5 includes some major new features, including ParametricUMAP, DensMAP, AlignedUMAP, model composition, and model updating. Thank you to everyone who contributed! 1/14
Understanding UMAP - an interactive introduction to the algorithm and how to us (and mis-use) it from
@_coenen
and
@adamrpearce
. A must read for anyone interested in dimension reduction.
UMAP 0.4 is now out! It includes a host of new features, including plotting support, better sparse data support, inverse transforms, and embedding to non-euclidean manifolds.
pip install umap-learn
See this thread for some of the new features:
An updated and significantly expanded version of our UMAP paper is now on arXiv:
More explanation, algorithm descriptions, and more experiments looking at stability, and working directly on high dimensional data -- as high as 1.8 million dimensional data!
UMAP version 0.3 is now available. You can now add new data to an existing embedding, embed using labelled data, or use both features for metric learning. Documentation is on readthedocs: .
Ever needed a few more colours than the standard colour cycle for your plot? Ever wanted a categorical colour palette based around your own custom colours? With glasbey you can create and extend custom categorical colour palettes with ease.🧵
The new numba based version of UMAP is out. Now faster than ever, it takes only 2.5 minutes to embed the full 70000 points of the 784-dimensional "Fashion MNIST" dataset.
Pynndescent, an approximate nearest neighbor search library, got a major update recently. Index construction is now multicore by default. Querying is now much faster -- competitive with some of the fastest ANN libraries around.
(1/4)
A new round of Approximate Nearest Neighbour search benchmarking by is out, including lots of new libraries and algorithms.
It is good to see PyNNDescent still performing very well.
My talk at PyData NYC on dimension reduction is now available. Hopefully it provides a useful basic taxonomy to help people navigate the vast zoo of dimension reduction techniques.
A new release of DataMapPlot adds the ability to place labels over top of the map for a word-cloud style look. As usual there remain a lot of options to fine tune and customize to your needs.
This is some amazing work from
@tim_sainburg
. Some major takeaways:
- lightning fast transform/inverse_transform operations (comparable to PCA if you have a GPU);
- semi-supervised classification: 97.8% accuracy on MNIST with only 4 labelled items per class!
New paper "Parametric UMAP: learning embeddings with deep neural networks for representation and semi-supervised learning" with
@leland_mcinnes
and
@TqGentner
! 1/
Have you been frustrated that HDBSCAN doesn't use all your cores, or is too slow? Fast-hdbscan is a numba based version of HDBSCAN that can use all your cores and significantly outperform the hdbscan python package for low-d Euclidean data.
If you have GPU resources handy the new HDBSCAN implementation in
@RAPIDSai
cuML is amazingly fast. You can get to millions of points clustered in only a few minutes!
If you could get a clustering algorithm and library specifically designed for fast clustering of embedding vectors (CLIP, sentence-transformers, Cohere-embed, etc.), what features would you most want it to have?
Playing with some nlp related tools I've been working on, I ended up with some nice visualizations. This is Top2Vec style topic words on a UMAP layout of 20-newsgroups document vectors using masked word-clouds for each newsgroup.
In collaboration with Google, we're releasing Activation Atlases: a new technique for visualizing what interactions between neurons can represent.
💻Blog:
📝Paper:
🔤Code:
🗺️Demo:
A great example of what UMAP is for: look at your data and realise it wasn't what you thought -- and then use it to ask better questions about your data before proceeding with fancier ML tools.
It was only when we visualized the UMAP that we got suspicious: the representations of all IDRs split into two big blobs. That's when we decided to interpret the features, and then we realized: half the features had a big "M" capturing the start methionine.
My talk on topological data analysis at ML Prague is already online! It provides a brief whirlwind tour of why topological methods matter for unsupervised learning problems.
#mlprague
2D UMAP of a 3D woolly mammoth, to build intuitions about how features are preserved in dimensionality reduction. Wonderful 3D scan from the people at
@3D_Digi_Si
.
Hypergraphs and simplicial complexes are going to become ever more prevalent. Here's a great article on some of the reasons why they are so interesting.
I'm considering dropping python 2.7 support for hdbscan and umap-learn. Let me know if this would be extremely painful for you. Also let me know if this would make you happy.
I really want to emphasize how amazing
@numba_jit
is. Pynndescent is pure python code relying on numba for acceleration. It is performance competitive with *highly optimized* C++ code. I still can't actually believe how incredibly well numba works!
Suppose UMAP could represent data not as 2d points, but as 2d gaussians with a full covariance matrix. Would that be useful? What would be the best way to represent that visually?
I have been revisiting pynndescent recently, and with help from the
@numba_jit
team I managed to get some significant performance gains. Preliminary tests on
@fulhack
's ann-benchmarks is looking very promising. Hopefully I'll have a new 0.5 release with these changes out soon.
🚀 𝐂𝐨𝐡𝐞𝐫𝐞 𝐄𝐦𝐛𝐞𝐝 𝐕𝟑 - 𝐢𝐧𝐭𝟖 & 𝐛𝐢𝐧𝐚𝐫𝐲 𝐒𝐮𝐩𝐩𝐨𝐫𝐭🚀
I'm excited to launch our native support for int8 & binary embeddings for Cohere Embed V3.
They slash your vector DB cost 4x - 32x while keeping 95% - 100% of the search quality.
Plots not meta enough? Here is a nice UMAP plot of different plots.
From "Viral Visualizations: How Coronavirus Skeptics Use Orthodox Data Practices to Promote Unorthodox Science Online"
Support for an "inverse transform" has been added to UMAP 0.4, providing the ability to generate a high dimensional representation of a point in the embedding space.
AlignedUMAP allows sequences of different UMAP embeddings to be aligned with each other according to relations among the datasets. This can be particularly useful for situations such as time evolving data. 7/ 14
An upcoming feature currently in the 0.5dev branch of UMAP will make this much easier to do. e.g.
mapper1 = umap.UMAP(metric='euclidean').fit(continuous_data)
mapper2 = umap.UMAP(metric="dice").fit(discrete_data)
consensus_mapper = mapper1 * mapper2
A paper in
@JOSS_TheOJ
for the UMAP software implementation is now published: .
Thanks to the editors (
@arokem
) and reviewers (
@TerryTangYuan
) for providing such a smooth process for publication.
@DrPattiJones
PCA provides a global linear projection onto the hyperplane defined by the directions of global maximal variance in your data. UMAP attempts to stitch together many local views of the data accounting for local variance, into an intermediate structure, then represent that in low D
Inspired by the t-SNE animation from
@ChaseClarkatUIC
I decided to try something similar for UMAP. Here is an animation for varying values of the n_neighbors parameter. Increasing values give more weight to global structure over local structure.
UMAP now has 1,000 github stars! Thanks to all the users and contributors! There are more features coming in version 0.3 soon, and some exciting ones in very early development.
@ch402
@SuhnyllaKler
@AnthropicAI
An example of current work: is linear optimal transport applied to word vectors a decent sentence/document embedding model? It turns out yes, yes it is.
There's still a long way to go to scale and benchmark on larger datasets, but it's promising.
A new minor release of umap-learn adds some very useful features:
- Updating ParametricUMAP to Keras3 (kindly provided submitted by
@fchollet
);
- Initial support for binary embedding vectors with metric="bit_hamming" and metric="bit_jaccard".
Out now, RAPIDS release 21.06! New
#cuML
and
#cuGraph
algorithms, new list functionality, a whole new way to measure
@RAPIDSai
progress with the change to CalVer, and much more!
@F_Vaggi
@leland_mcinnes
FIt-SNE uses an O(N) interpolation scheme to accelerate the computation of the gradient at each step. More details are available in the preprint () or some notes I wrote ()
I belatedly got to experimenting with FIt-SNE from
@GCLinderman
. It's very impressive and very fast -- definitely the implementation you should be using if you want to use t-SNE for visualization.
The ambient coordinates of your data (coming from features) need not be related to the intrinsic notion of distance internal to the data itself. An idea worth wrapping your head around.
Checkout Etienne Becht's bioRxiv preprint that compares UMAP with t-SNE for visualizing CyTOF and scRNAseq data. Many advantages of UMAP over t-SNE for high dimensional single-cell data!
@leland_mcinnes
Documentation for UMAP 0.4 now includes examples of UMAP usage for visualization, exploratory analysis, and scientific publications. If you have a compelling use case, we would love to include it as well.
This was a fantastic series of of posts! If you want a well written intro to some of the ideas in topological data analysis this is a great place to start.
@asemic_horizon
@scikit_tda
@leland_mcinnes
I wrote a series of posts leading up to some TDA (see "Topology" section here: ) And then a few posts in the TDA family before I lost steam (see Computational Topology section of )
It is a huge testament to the power of
@numba_jit
that a pure python library like PyNNDescent can be performance competitive with C++ libraries from Google (ScaNN), Microsoft (DiskANN), and Facebook (FAISS) among others.
Many, many thanks to the whole
@numba_jit
team!
@EmilyTWinn13
@SC_Griffith
After the flood Noah is checking up on the animals. They're all breeding well, except for a pair of snakes. Noah gets a little worried and follows them. Eventually they find a fallen tree, and suddenly ... lots of baby snakes. It turns out that adders need logs to multiply.
@rctatman
Here's a plan we use: Take the term-frequency matrix, remove the "expected" frequency (by subtracting, or using the column marginal as a noise model), UMAP with hellinger distance, and HDBSCAN for clustering. Still fine tuning the process, but has been very powerful so far.
An amazing introduction to UMAP and its parameters. This is for UMAP what the Distill article was for t-SNE. Great work from
@_coenen
and
@adamrpearce
as always!
@michaelhoffman
Many of the t-SNE (and UMAP) plots I see suffer from potential over-plotting issues. This is particularly dangerous if you are trying to eyeball cluster purity. Using such plots as a starting point for further analysis rather than an endpoint is critical.
This is a fascinating paper -- using a contrastive approach on augmentations of images to learn a low dimensional representation they generate truly impressive results for image datasets!
Ever wondered what image datasets look like if they could be visualized? We have developed a new algorithm for visualization based on contrastive learning. Joint work with
@hippopedoid
and
@CellTypist
. The full details are available as a preprint 🧵/16
I've started telling people "Look at your data, because whatever you think you know about the data is almost certainly wrong". I'm not sure it works any better, but at least I warned them...
“Have you tried looking at the data?” is my most common question when talking to folks who are inexperienced with data. Over the last two years, about 90% of the time, the answer has been, “Why?” or “What good would that do?” 🙄
I'll be speaking at the Fields Institute today on using UMAP theory for general unsupervised learning. I'll be happy to chat more about these ideas afterwards as well.
I will be co-chairing the machine learning track at SciPy this year. Submissions are open, so if you have a machine learning project in python consider submitting. This is a great opportunity to share your work with a wide audience.
@SciPyConf
The core neighbor search in UMAP has been expanded upon in a separate library, PyNNDescent, which provides significantly improved performance. Combined with PyNNDescent UMAP 0.4 now support multi-core computation end-to-end (MNIST in ~45s on a laptop).