The more papers I read for a review article I'm writing about ML pitfalls in genomics, the more my faith is shaken in the results from papers that apply machine learning to methylation arrays. A salty thread. 1/
@TheBcellArtist
They probably go places that will pay them appropriately for their skills. Paying post-docs under $70k is common but obscene in most fields, given how critical they are.
Selecting features using all data before splitting into folds for training/testing is a big source of train-test leakage. To demonstrate, I generated random data and labels, select down to 25 features, and train a model. Much better than random performance due to the leakage.
The more I use classic bioinformatic tools, e.g. bwa and vcftools, the more I dislike current trends in bioinformatic tooling; pipelines are nice but if I want to test out your method the first step shouldn't be "set up a Terra/GCP/AWS account."
It's frustrating reading comp bio articles these days because many keep falling into the same pitfalls. Hard to know if the method actually works, or whether they messed up the evaluation. Here are some issues I've seen recently (w/o names):
Thrilled to announce that I'll be joining the incredible researchers at
@IMPvienna
for a year as a visiting scientist and then joining
@UMassChan
as an assistant professor in Genomics+CompBio in 2025!
At both places, I'll be continuing my work on deep learning + genomics.
Why are you confused? There's just genes. And alternate splicing. And regulatory elements. And regulatory elements in the alternate splicing. And regulatory elements are transcribed. And RNAs can do things. And proteins can fold differently in different cell types. And...
@CT_Bergstrom
This entire time I knew in the back of my mind that you were a person but, because I've only seen you on Twitter, I just assumed you were a benevolent bird sharing your vast knowledge of biology with us. Illusion shattered by the picture in this article. :(
Me, a former sklearn dev, hiding under the bed:
Armed robber: ...
Me: ...
Armed robber: ....
Me: ....
Armed robber: Logistic regression shouldn't have a default L2 regularization of 1
Me: *still hides*
@naomirwolf
@BillGates
As a researcher at U of Washington, I remember when
@BillGates
walked into my lab and said "Stop working on this, we must work on vaccine microchips!" and we dropped all our grant-funded work immediately. We would've gotten away with it too, if you didn't point it out on Twitter.
CS/ML people venturing into biology frequently assume that the data they're given is clean and that all the upstream processing steps have been figured out. This is absolutely not the case.
I would encourage CS/ML people to really look into the gritty details like this.
Sequence-based ML methods (Enformer, ChromBPNet...) are invaluable in genomics but the ecosystem for their *use* after training is less developed.
Introducing, `tangermeme`: a PyTorch library for genomics discovery for everything-other-than-the-model. 1.
Finally out in
@NatureRevGenet
: Navigating the pitfalls of applying machine learning in genomics! w/
@seawhalen
et al.
Our key point: you MUST evaluate your models in the same setting you want them to be used or they might not actually work in practice.
PSA: There are no such things as "enhancers," "promoters," and "silencers." There are only TF binding sites and those TF's effects on the steps of transcription and degredation.
I regularly hear people in ML+genomics complain that they're running out of memory or disk space. Frequently, the culprit is inefficient handling of RNA/DNA sequence and you can make big gains in compression with a few tricks. 1/
A bit ago, I got a grant from
@NumFOCUS
to rewrite pomegranate from the ground up using a PyTorch backend. The goal was to increase speed, decrease code size, and decrease the barrier to writing custom components or integrating w PyTorch. The results have been incredible so far.
Found out last night that
@NumFOCUS
funded my proposal to rewrite
#pomegranate
from the ground up using
@PyTorch
as the backend! Need to train massive HMMs using multiple GPUs, or want a mixture of negative binomials as part of your neural network? Watch this space!
Here's another genomics ML pitfall: account for fragment length when modeling multiple genomics experiments! If you don't, your predictions will probably look a little bit... off... even though the model is correct! Why? A thread: 🧵 1/
pomegranate v1.0.0 has been released! This major release is a complete rewrite using
@PyTorch
to replace the Cython backend.
Same great probabilistic models, now WAY faster, GPU support, fewer installation issues, and easier to extend.
Check it out! 1/
It's been since my last pitfall in genomics thread but he's a new one: YOU MUST ACCOUNT FOR READ DEPTH in single-cell experiments.
Why?
Because read depth will likely be confounded by CELL IDENTITY in ways that can induce leakage in downstream ML methods.
This fiasco is exactly why I read ML papers in genomics with such a critical eye, and try to write about pitfalls as much as I can.
Genomics data is COMPLICATED and ML methods are eager to please. It's easy to mess up, and when you do, you'll appear to get good performance.
And then they asked, can we correctly classify these cancers based on zero raw data? And of course, the answer was yes - all the classified power is derived from the idiosyncratic zero-to-something normalization enacted by Voom-SNM, and none from the actual raw data. 24/
A flaw I'm seeing in a lot of papers is that they think that "cross-validation" gives you permission to perform architecture search on the test set. If the cross-validation involves the entire data set and you choose models based on best performance on it, you're making an error.
a convo I had before grad school
me: does a phd make you feel like an expert on a topic?
phd: the opposite
me: do you feel productive while doing research?
phd: the opposite
me: do you at least get paid well for all the stress?
phd: the opposite
me: sign me up
As a casual reminder to reviewers and authors: if you are working on a biology task and you use random cross-validation, you are making a mistake. It's truly disheartening to review a paper and see this because you have no idea just how distorted the results are.
Computational biology is becoming the same thing. So many papers and talks I see recently are benchmark-driven, not science-driven.
Uncovering something scientifically interesting is seen as an optional final step if you want to get into a top journal, not a key motivation.
Counterpoint: if you joined NLP recently, you might think that language understanding is about beating benchmarks, rather than converting syntactic strings to meanings (or vice versa)
In the short-term, you might think that’s good
But hallucinations may well get you in the end
At the beginning of 2018 at an
@ENCODE_NIH
meeting, the idea for the ENCODE Imputation Challenge was born: an open contest to predict genome-wide genomics experiments given fixed train/test sets and encourage development of large-scale imputation methods.
warning: drama 🧵
1/
Ready for a new "ML pitfalls in genomics w/ Jacob"? When evaluating ML models across cell types/individuals, you MUST baseline against the avg activity or risk being fooled by seemingly good performance. Thrilled to finally see this quick read out! 1/
Happy to share new work on a pitfall you can fall into if you train ML models to predict across cell types. TL;DR, always compare your predictions to the per-locus average activity, it's a hard baseline to beat!
@uwescience
@uwgenome
@uwcse
@EncodeDCC
After several months of work, I'm excited to announce the first release of torchegranate, my
@PyTorch
rewrite of pomegranate!
torchegranate is faster, more readable, better tested, and easy to extend.
Try it out with `pip install torchegranate`! 1.
The first fruit of my post-doc is finally dropping: Yuzu! Yuzu speeds up in-silico saturated mutagenesis using principles of compressed sensing, over an order of magnitude on many common architectures for both protein and DNA inputs. 1/
🌠paper🌠:
A canonical mistake you can make when performing machine learning involves performing data preprocessing outside of cross-validation. This involves applying transformations or feature selections before splitting into a train/test split. 2/
An unfortunate trend I'm seeing in comp genomics right now are submissions that think simply adding complexity to their model is a meaningful contribution. To me, it doesn't matter how complex your model is, it matters how useful it is in practice or what you discover with it.
@michaelhoffman
No one will use your computational method outside your group, unless it's for basic data processing, so you better be prepared to do all the legwork of applying it all the way to scientific discovery because no one else will.
This "classic" editorial should be required reading for any new student trying to apply ML in genomics -- particularly, for those coming at it from a CS perspective. Be skeptical of your own performance measures!
Last week was my last at
@uwgenome
. Today, I start a post-doc with
@anshulkundaje
at
@Stanford
! When I took the position I imagined there would be more pomp and circumstance than logging out of one server and logging into another...
I've used this example in the past: consider ENTIRELY RANDOM data. What happens if you select the top features and then do cross-validation? You get better than random performance because the selected features coincidentally line up with the labels. 7/
Sometimes I feel like using
@numba_jit
is cheating. I was concerned that an analysis was taking too long, at ~40 minutes per file, so I just slightly rewrote and jitted the function and now it takes 7 seconds.
Seeing this mistake in scientific papers is bad enough but seeing it be subtly integrated into workflows means even more people will inadvertently make this mistake. If you are working with methylation arrays, please ensure you do probe selection only on the training set! 12/12
You've probably seen attribution tracks where the height of each letter is its "importance" to a predictive model and motifs pop out.
But the technical details behind how these are calculated can matter a lot -- and I'm worried many may be done incorrectly. 1/
Just released our preprint on apricot, a Python package implementing submodular selection for machine learning! It efficiently finds subsets of data that are representative of the whole space. Check it out!
@uwescience
@uwcse
Reading the Reddit thread about predictions for bioinformatics in 2040 () made me realize that I straight up ignore GO analyses in papers unless there's a very specific point being made (almost never). Do other people take them seriously?
At the beginning of 2018 at an
@ENCODE_NIH
meeting, the idea for the ENCODE Imputation Challenge was born: an open contest to predict genome-wide genomics experiments given fixed train/test sets and encourage development of large-scale imputation methods.
warning: drama 🧵
1/
I was always skeptical of single-cell data simulation methods because we still have lingering questions about what exactly the readout is (e.g., it's not a uniform sampling of active genes in a cell). Good to see work on it.
What's the point of comp bio models that can only make predictions for experiments that have already been performed (e.g. DeepSEA, Basset, Enformer, BPNet, etc)? In Rit's/my latest short review on ML in comp bio, we discuss! 1/8
Ledidi turns any predictive model (BPNet, DeepSEA, Enformer, AlphaFold...) into a biological sequence editor! After years, I released a new version with significant QoL improvements including.. being in PyTorch.
Try it out w/ `pip install ledidi`
In this episode of
@bioinfochat
, I interview
@lkpino
about the limits of mass spec measurements and how proteomic measurements can be integrated with genomic measurements. Every time I talk to her I always learn a ton!
What's wrong with this? Well, you're leaking information from your test set into your training set because you're selecting probes that, by construction, have large differences / perform well on both your training and test set. 6/
Time for another pitfall in genomics thread!
Normally, the output from a genomics experiment are reads mapped to a reference genome. More reads = stronger signal.
But the total number of reads can confound machine learning analyses and statistical tests. 1/
I think it says something about my experiences in academia (and I doubt I'm alone) when I'm shocked to get reviews back that, although will require a lot of work to address, are generally supportive and provide constructive feedback.
@AcademicChatter
After 5 months of effort and giving up twice, I was finally able to reproduce TOMTOM. Lots of small details and a few bugs in the code...
On a large-scale task, TOMTOM is taking ~978s and my version with some basic speedups is taking ~1.2s.
Out soon!
It's always fun to fail to make basic connections about your data as a computational person.
me: so, this sample is labeled "healthy" but are we sure the person is healthy?
@anshulkundaje
: well, it's a heart sample, so they're dead
When doing grid search, why do you need to evaluate your final model on data other than the set you used to tune hyperparameters? Here's an example. Random data, labels, and predictions yields much better than random performance in a gridsearch-like evaluation.
My
@uwcse
@uwescience
thesis is now online ()! Check it out if you want to learn about my work with Avocado, imputing >30k genomics experiments, and ordering future experiments. I also wrote a 2 page tl;dr overview:
In this week's "ML pitfalls w/ Jacob," we're going to talk about data set creation! Problem data sets occur in every field, but I frequently see them in genomics because people build their own data sets from new experimental data. 1/
When designing bioinformatics software with an eye toward the future, an important choice will be designing towards what hardware supports (GPUs with 192Gb memory, for instance) vs. what most people using your software will have (laptop + depression).
Basically: production-intended pipelines should probably involve WDL/etc but be focused on internal use. For maximal external effect, your tool should take in standard file formats, run each step as a single command line w/ options, and output a standard format.
In our latest episode of
@bioinfochat
, we talk with
@Avsecz
about research in academia vs industry, Enformer, and deep learning libraries! Great to hear about the work directly from the source. Hope other people enjoy our conversation!
Added a super-fast one-hot encoding function to `tangermeme` last Friday, and I'm still surprised by how fast it is. Timings are encoding chr1.
for-loop: ~40s
numpy-vectorized: ~12s
new: ~1s
Thought I'd share some intuition for why it works so well.
me: I trained a GAN using Avocado to generate fake imputations
advisor: okay, what questions can it help us answer?
me: ...
advisor: ...
me: ...
advisor: what questions ca-
me: it's named AvoGANo
advisor: ...
advisor: let's look into moving your graduation date up
(satire)
Ultimately, these papers are a symptom of a broken academic system. There is less value in spending time dissecting a system than there is in doing a surface-level analysis and moving on to the next thing, leaving a trail of bad tools that causes people to not trust anything.
Super excited to be joining the amazing team at
@JOSS_TheOJ
as a topic editor for bioinformatics and machine learning. If you wrote a great software tool that supported amazing research, write it up and send it my way! Good software deserves more recognition in research.
I'm shocked -- shocked! -- to find out that the department I interviewed at that emailed my advisor unsolicited critiques of my performance behind my back was unable to recruit this cycle.
@timrpeterson
The biggest problem I've seen in biotech is people who don't understand their data and lose years just learning bias. I'm not sure that getting rid of people with domain knowledge will solve this.
Glad to see that the
@numpy
review article is out! The package has had a massive effect on the adoption of Python and the development of the entire ecosystem.
pomegranate v0.9.0 released! The main focus was on adding missing value support for model fitting / structure learning / inference across all models. Read more about it here:
@uwescience
@uwcse
@NumFOCUS
As you increase the number of features or decrease the number of examples in your original data set this problem becomes worse because there is a higher chance to see spurious correlations. Select features using your training set, not all your data!
When UMAP goes wrong. If you pass in similarities (where 1 means closest) rather than distances (where 0 means closest), you can get very artistic results as you smooth in the wrong direction.
On Thursday (4:40am PST ugh) I'm giving a talk at
#ISMBEECCB2021
#MLCSB2021
on five pitfalls to avoid when applying ML to genomics data! Although conceptually simple, they can be extremely difficult to identify in practice if you don't know what to look for. 1/
After 6 years of challenges, setbacks, successes, and corgi viewings, I've scheduled my thesis defense. It always seemed so far away until suddenly it was here. I know that I wouldn't have made it without a support network.
@AcademicChatter
#AcademicChatter
How do I know this has to do with data preprocessing being outside the train/test split and not me actually secretly generating a data set with secret structure in it? Let's put the feature selection IN the CV. Performance plummets. 10/
Once again I accidentally fed in a similarity matrix to UMAP instead of a distance matrix.
@leland_mcinnes
implemented the best warnings for when this happens---your plot looks like a creature whipping you for being wrong.
Proud to finally release Avocado! Avocado is a deep tensor factorization model that imputes epigenomic signal better than prev work, and the latent factors yield better ML models on genomics tasks than the data it was trained on.
@uwescience
@uwcse
Even "pull this docker container" is frustrating when I just want to test your approach. I get that for research work you want to ensure precise reproducibility and these tools might be the right choice, but it's more challenging to hack and learn when setup is an ordeal.
Regretting coming to
#RECOMB2022
. Most people not wearing masks, coughing and sneezing are near constant in the audience, someone I know already has gotten COVID. Who would feel safe sitting in the audience of this? Talks are good though.
very unfair how universities will reimburse conference expenses including food if you go "in-person" but won't reimburse this entire pizza i ate alone in bed while watching pre-recorded conference talks at midnight
The paper just dropped on
@biorxivpreprint
. Give it a read, and let us know what you think!
Our main point: doing genomics work correctly is HARD. Please don't just use data you find on the internet without knowing how it was processed. 18/
This also isn't a problem with supervised machine learning models. If you take your data, select down a smaller number of features (here going from 10k features to 200) even PCA will return distinct clusters. 9/
My
@SciPyConf
talk, "apricot: Submodular optimization for machine learning," is online! Learn about a principled way to reduce massive data sets down to representative subsets that are widely useful. Also,
#GossipGirl
.
Thanks
@uwescience
for support!
Excited and proud to receive the
@acm_bcb
2020 best paper award for my work on making zero-shot imputations across species! Like most work, this would not have been possible without my co-authors. Here is a thread summarizing the paper:
This paper proposed an approach for supplementing functional imputation models using human data when making imputations in other species, including making "zero-shot" imputations of assays performed in human but not in the other species. Here are four examples:
two months before deadline: i hate this paper
one month before deadline: i hate this paper
two weeks before deadline: i hate this paper
three days before deadline: this is actually really interesting lets come up with a thousand experiments we could have done
Roman and I just released a new
@bioinfochat
episode! () This time, we interview
@drklly
about
@calico
, Basenji, and how machine learning models can be used to help us understand the functional consequences of genetic variation.