Rapid protein evolution by few-shot learning with a protein language model
- EVOLVEpro, PLMs with active learning, enables rapid and efficient evolution of protein activities. Show promosing optimization of Antibody, CRISPR Nuclease, Prime Editor, Integrase, and T7 RNA Polymerase
If you're looking for labs working on Protein Design, we've put together a list based on our best knowledge. The initial list came from
@Zuricho_zbzt
, and with a bit of help from me, we've now got it up on GitHub. Feel free to suggest any labs we might have missed!
Deep learning guided design of dynamic proteins
- Design proteins with conformational changes, focusing on intra-domain reorientation of secondary structural elements
- Use systematic physics-based conformational sampling and Rosetta Design to create a library of alternative
CRISPR-GPT: An LLM Agent for Automated Design of Gene-Editing Experiments
Use Agent "to facilitate the process of selecting CRISPR systems, designing guide RNAs, recommending cellular delivery methods, drafting protocols, and designing validation experiments to confirm editing
AlphaFold2 knows some protein folding principles
-Use AF2 without MSAs/templates, mimicking an ab initio approach. The iterations show AF2's energy landscape and "local first, global later" folding mechanism.
- Folded intermediates of six small proteins (protein G, protein L and
Did AlphaFold solve the protein folding problem? ...Not yet! AF2 predicts static structures, usually the native state by default. However, we found AF2 can generate structures aligning well with known folding intermediates.
@Al__Perez
@UFChemistry
1/n
Simulating 500 million years of evolution with a language model |
@EvoscaleAI
- ESM3 (1.4B, 7B, and 98B), a multimodal protein language model on sequence, structure, and function tokens, using MLM for representation learning and generation.
- Uses a MLM objective with diverse
Training Compute-Optimal Protein Language Models
- Trained 300+ models with 3.5M to 10.7B parameters on 5 to 200B tokens, comparing CLM and MLM scaling behavior
- Compiled a dataset of 939M protein sequences (194B tokens) to address overfitting
- Observed a transfer phenomenon
"Deep Generative Models of Protein Structure Uncover Distant Relationships Across a Continuous Fold Space" has been revised
It "provides a sensitive approach to detect and thus explore distant protein relationships"
paper:
github:
A list of methods for tokenizing protein structures:
- Foldseek
- ProSST
- FoldToken 1 & 2
- Learning the Language of Protein Structure
Did I miss any works?
Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures
- Dynamic PDB contains 12.6K proteins subjected to 1 microsecond of molecular dynamics simulations with detailed physical properties such as
Antibody design using deep learning: from sequence and structure design to affinity maturation |
@BriefingBioinfo
"This survey highlights significant advancements in protein design and optimization, specifically focusing on antibodies. This includes various aspects such as
DeepEnzyme: a robust deep learning model for improved enzyme turnover number prediction by utilizing features of protein 3D-structures |
@BriefingBioinfo
- Combine both sequence and 3D structural features of proteins to improve enzyme turnover number (kcat) prediction accuracy
Accurate prediction of protein function using statistics-informed graph networks |
@NatureComms
- PhiGnet, uses evolutionary couplings (EVCs) and residue communities (RCs) with Dual-Channel Graph Convolutional Networks. Embed sequences via ESM-1b
- Use Grad-CAM method computes
ProteinCLIP: enhancing protein language models with natural language
- CLIP-like Model aligns embeddings from protein language model and language models describing protein functions
- Excels in PPI, Homology, and Mutation identification
Predictomes: A classifier-curated database of AlphaFold-modeled protein-protein interactions
Interesting PPI virtual screening work based on AlphaFold-Multimer.
- SPOC (Structure Prediction and Omics informed Classifier) is a random forest-based classifier that accurately
RNA language models predict mutations that improve RNA function
A great work from Jennifer Dounda and Jamie Cate's Lab!
- Use Genome Taxonomy Database (GTDB) to build the GARNET (Gtdb Acquired RNa with Environmental Temperatures) database
- Train a generative GNN model using a
Learning the Language of Protein Structure
From
@instadeepai
- Maps protein backbone to continuous downsampled representations by MPNN
- Discretizes representations into tokens using Finite Scalar Quantization
- Reconstructs protein structures from tokens using a structure
FlowPacker: Protein side-chain packing with torsional flow matching
- FlowPacker, a fast and accurate model for predicting side-chain conformations using Torsional Flow Matching and Equivariant Graph Attention Networks
- Inference with an exponential schedule for the vector field
A bioactivity foundation model using pairwise meta-learning |
@NatMachIntell
- ActFound is trained on 1.6 million bioactivity data points across 35,644 assays to predict the bioactivity of compounds using pairwise meta-learning (predicting the relative difference)
- Encode
Structure prediction of alternative protein conformations |
@NatureComms
- Cfold, a structure prediction model designed to predict alternative protein conformations.
- Train AF2 without the Template Track and focus on MSAs and coevolutionary signals
- Predict different
Check out my new work with
@FrankNoeBerlin
where we answer if the structure of different protein conformations really can be predicted. Also try the Colab to see if your protein has different conformations:
Accurate Conformation Sampling via Protein Structural Diffusion
- Diffold, a diffusion-based model for robust sampling of diverse protein conformations using amino acid sequences
- Transforms AlphaFold2 into a diffusion model, and applies hierarchical reweighting based on
Learning to design protein-protein interactions with enhanced generalization ICLR24'
- PPIRef, the largest and non-redundant dataset of 3D protein–protein interactions,
- PPIformer, a new SE(3)- equivariant model generalizing across diverse protein-binder variants.
- Finetune
Fine-tuning protein language models boosts predictions across diverse tasks |
@NatureComms
- Finetune pLMs (ESM2, ProtT5, Ankh) on different tasks (GB1, GFP, AAV, Location, Meltome, Stability, Disorder Prediction, and Secondary Structure Prediction)
- Explore various PEFT
Generative Modeling of Molecular Dynamics Trajectories | NeurIPS 24' |📸"molecular video"
- MDGEN, a flow-based model for MD trajectories, with various capabilities (forward simulation, interpolation, upsampling, and molecular design)
- Tokenize structure as Roto-Translation and
Toward De Novo Protein Design from Natural Language
- T2struct, encoder-decoder architecture, uses PubMedBERT for encoding text and GPT-2 for decoding structural tokens
- Retrain SaProt with projected text embeddings and structural tokens for sequence generation (Retrain ProGen
Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure
- CHEAP, a novel method for compressing protein sequence and structure latent space (ESMFold), achieves up to 128x channel and 8x length compression from sequence input alone
- Uses per-channel
Unsupervised learning of progress coordinates during weighted ensemble simulations: Application to millisecond protein folding
- Improve rare events in protein folding (e.g., state transitions) through weighted ensemble simulation and an unsupervised deep learning model.
- Use a
Large protein databases reveal structural complementarity and functional locality
- Cluster AFDB with FoldSeek, Annotate with deepFRI, Generate embeddings with Geometricus and Use PaCAMP for dimension reduction
Preprint:
Unsupervised evolution of protein and antibody complexes with a structure-informed language model |
@ScienceMagazine
- Train on millions of nonredundant pairs of protein sequences and backbone
- Use autoregressive modeling to integrate sequence and structural information
PocketGen: Generating Full-Atom Ligand-Binding Protein Pockets
- Co-designs the residue sequence and full-atom structure of protein pockets for binding
- Uses a bilevel graph transformer to model multi-granularity (atom and residue/ligand level) and multi-aspect (intra-protein
De novo design of Ras isoform selective binders | Baker Lab
- Use several methods to generate backbone for disordered Ras C-terminus: Amino Acid Recognition Pocket-Based Design, Scaffolded RFDiffusion, and Sequence Input RFDiffusion
- For Pocket-Based Design: Build an initial
Geometric deep learning of protein–DNA binding specificity |
@naturemethods
- Represent DNA structures as symmetrized helices, and Proteins as atom-based graphs with (one-hot atom type, solvent-accessible surface feature, and Atchley factors etc.)
- Use spatial Graph
Excited to share: "Fine-tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design"
With my amazing coauthors Masatoshi Uehera,
@yiyiyihe
,
@amywang01
,
@tbyanc
,
@lal_avantika
, Tommi Jaakkola,
@svlevine
,
@hcwww_
, Aviv Regev
De novo protein design with a denoising diffusion network independent of pretrained structure prediction models |
@naturemethods
- SCUBA-D uses a two-step denoising process. First, generate an initial low-resolution backbone, then perform multiple steps of denoising to generate
Multi-Scale Protein Language Model for Unified Molecular Modeling | ICML 2024
- ESM-AA, a multi-scale protein language model that enables unified modeling at both the residue and atom scales
- Pre-training on multi-scale code-switch protein sequences that randomly unzip residues
Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering |
@NatureComms
- MODIFY, optimizes diversity at the residue level using Pareto optimization (Stochastic Gradient Ascent) to balance both fitness and
Training Compute-Optimal Protein Language Models
- Trained 300+ models with 3.5M to 10.7B parameters on 5 to 200B tokens, comparing CLM and MLM scaling behavior
- Compiled a dataset of 939M protein sequences (194B tokens) to address overfitting
- Observed a transfer phenomenon
A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
A new 5′ untranslated region (UTR) Language Model!
- Pretrain the model with mask prediction, 5' UTR secondary strcuture prediction, and minimum free energy prediction
- Finetune [CLS]
Force-Guided Bridge Matching for Full-Atom Time-Coarsened Dynamics of Peptides
- Force-Guided Bridge Matching (FBM) learns the dynamics between two states using a Brownian bridge process with the integration of an intermediate force field as guidance for a Boltzmann-like
A catalog of small proteins from the global microbiome |
@NatureComms
- "construct a global microbial smORFs catalog (GMSC) derived from 63,410 publicly available metagenomes across 75 distinct habitats and 87,920 high-quality isolate genomes. GMSC contains 965 million
Improving AlphaFlow for Efficient Protein Ensembles Generation | ICML 24' Workshop
- AlphaFlow-Lit focuses on fine-tuning only the lightweight structure module for faster sampling
- Treat AlphaFold as a sequence-conditioned denoising model, focusing on precomputed single and pair
Reinforcement Learning for Sequence Design Leveraging Protein Language Models
- Investigate RL algorithms for protein sequence design using pLM as a reward function
- Use ESMFold as the oracle pLM, and Distill it into a smaller model to serve as the proxy reward model
- Train the
De novo design of ATPase based on the blueprint optimized for harboring the P-loop motif
- Use Rosetta to design a stable conformal backbone harboring the P-loop (phosphate-binding loop) for ATPase activity
- Conduct fragment assembly simulations with the β-(P-loop)-α-β motif,
Protein Language Models in Directed Evolution | ICML Workshop
- Use MSA Transformer for guided directed evolution on enzyme variants (PET degradation)
- Few-shot mode takes a small amount of experimental data to train a ridge regression model on top of the MSA Transformer
-
Full-Atom Peptide Design based on Multi-modal Flow Matching | ICML 24'
- PepFlow, uses a conditional flow-matching framework to model peptide binder structures and sequences
- Position (Euclidean space), Orientation (SO(3)),
Angles (Toric space), Type (Categorical space)
InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions
- Use an adapter to connect the Structure Encoder (ProteinMPNN and others) with pLM (ProGen2)
- Outperforms ProteinMPNN in terms of Perplexity and Recovery Rate
- Validate the model by
A conditional protein diffusion model generates artificial programmable endonuclease sequences with enhanced activity |
@CellDiscovery
- Train conditional protein diffusion model, CPDiffusion, a Equivariant Graph Denoising Network for Argonaute (Ago) proteins under the DDPM
Recent Papers from Baker Lab
1. Designed endocytosis-inducing proteins degrade targets and amplify signals |
@Nature
2. Multistate and functional protein design using RoseTTAFold sequence space diffusion |
@NatureBiotech
ProteinGenerator Paper
ProteinBench: A Holistic Evaluation of Protein Foundation Models
- Benchmark on Inverse Folding, Backbone Design, Sequence Design, Structure-Sequence Co-Design, Motif Scaffolding, Antibody Design, Protein Conformation Prediction
Preprint:
Project Page:
Fast, sensitive detection of protein homologs using deep dense retrieval |
@NatureBiotech
-Dense Homolog Retriever (DHR) employs a bi-encoder (ESM1b, first vector as fixed-length vector) architecture and a CLIP-like approach to train on homologous pairs with in-batch negatives
-
Designing molecular RNA switches with Restricted Boltzmann machines
- Use Restricted Boltzmann machines to design artificial SAM-I riboswitches, focusing on their aptamer domain
- The designed sequences were validated through chemical probing, with approximately 30% demonstrating
RNAFlow: RNA Structure & Sequence Design via Inverse Folding-Based Flow Matching | ICML 24'
- RNAFlow, a flow matching (FM) model for RNA sequences and structures generation, conditioned on protein interactions
- Combines an RNA inverse folding model with a pretrained
Accurate structure prediction of biomolecular interactions with AlphaFold 3 | Nature
They use Diffusion with Transformer😲. Replaced invariant point attention with a relatively standard non-equivariant point-cloud diffusion model over all atoms
Accelerating protein engineering with fitness landscape modeling and reinforcement learning
- µFormer, pre-trained using a pairwise masked language model (next thread) on UniRef50.
- Fine-tune and evaluate the model on FLIP and ProteinGym (random split 🤔) using residue (capable
ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning
- Aligns sequence-structure, sequence-function, and structure-function pairs by ESM, BERT, and Foldseek
- Leverages max-inner product search for rapid retrieval
preprint:
Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks
- Benchmark six tpLMs (OntoProtein, ProteinDT, ProtST, ProteinCLIP, ProTrek, ESM3) against ESM2-3B on six tasks (GB1, GFP, AAV, Location, Meltome, Stability)
- No tpLM
Preference optimization of protein language models as a multi-objective binder design paradigm
Use DPO (Direct Pereference Optimization) for Peptide Binder Design with ProGPT2
We’re presenting AlphaProteo: an AI system for designing novel proteins that bind more successfully to target molecules. 🧬
It could help scientists better understand how biological systems function, save time in research, advance drug design and more. 🧵
Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning |
@NatureComms
- Propose FSFP with model-agnostic meta-learning, Learning to Rank (ListMLE loss), and LoRA to enhance protein language models for few shot learning of fitness
- In
PiNUI: A Dataset of Protein-Protein Interactions for Machine Learning
- PiNUI, Protein interactions with Nearly Uniform Imbalance
- Construct the negative set exclusively from positive sequence pairs, sampling two proteins that each interact with only one other protein in their
BulkRNABert: Cancer prognosis from bulk RNA-seq based language models
- Transform gene expression values into tokens by binning Transcripts Per Million (TPM) values
- Train Self-supervised MLM over expression data
- Finetune MLP head for cancer type classification and survival
FAPM: Functional Annotation of Proteins using Multi-Modal Models Beyond Structural Modeling
- BLIP-2 for Protein Annotation
- Combine ESM2 and Mistral-7B
- Outperforms DeepGo series
De novo design of miniprotein antagonists of cytokine storm inducers |
@NatureComms
- Use Rosetta-based binder design approach (Cao, et al., 2024) against IL-6R, GP130, and IL-1R1
- Dock 40k de novo protein scaffolds to hotspot residues (Patchdock and Rifdock), and design 2.5
What has AlphaFold3 learned about antibody and nanobody docking, and what remains unsolved?
- Evaluate models on Structural Antibody Database (SabDab)
- AF3 improves docking accuracy over AF2-M and AlphaRED, with a 38.4% success rate for antibodies and 36.1% for nanobodies
Adapting protein language models for structure-conditioned design
- proseLM, enhances ProGen2 with Structural Adapter Layers
- Causal Encoder uses message-passing (MPNN) and invariant-point message-passing (IPMP) layers to capture structural information of both protein and
Evolution-Inspired Loss Functions for Protein Representation Learning
Evolutionary Ranking (EvoRank) incorporates evolutionary dynamics from MSA-based Soft Labels to learn more diverse protein representations
Protein isoform-centric therapeutics: expanding targets and increasing specificity |
@NatRevDrugDisc
- "highlight three modes of action for protein isoform-centric drugs: isoform switching, isoform introduction or depletion, and modulation of isoform activity. In addition, we
ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding
Q: <Protein><Struct><Seq> </Protein><QuestionP rompts>
A: < Description >
- Align sequence and structure modalities (frozen ESM-2 and ESM-IF) via projection layers
- Use instruction tuning
General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design
- Introduce BADGER, Binding Affinity Diffusion Guidance with Enhanced Refinement
- Uses an Equivariant Graph Neural Network (EGNN) to approximate AutoDock Vina’s non-differentiable energy
Antibody DomainBed: Out-of-Distribution Generalization in Therapeutic Protein Design
- Curated antibody dataset using SAbDab and the Walk Jump Sampler method, used a surrogate model (e.g., PyRosetta) to label binding energy, and split into 5 environment sets
- Benchmarked SeqCNN,
Diffusing protein binders to intrinsically disordered proteins
- Finetune RFdiffusion to accept secondary structure specifications along with sequence input. Add partially masked secondary structure and "block adjacency" information.
- Input Target protein sequence (optionally
Peptipedia v2.0: A peptide sequence database and user-friendly web platform. A major update
- Expand collection by over 45%, improved functional biological activity tree (managed by managed by PostgreSQL)
- Train over 90 binary classification models using protein language models
Contextual AI models for single-cell protein biology
- PINNACLE, a geometric deep learning model for generating context-specific protein representations via link prediction and cell type classification pretraining
- Construct context-sensitive protein interaction networks and a
EquiScore, a novel protein-ligand interaction scoring method integrating physical prior knowledge
@NatMachIntell
- Uses a heterogeneous graph neural network to evaluate interactions in equivariant geometric space
- Constructs the PDBscreen dataset by combining redocking,
Aligning protein generative models with experimental fitness via Direct Preference Optimization
@talaldotpdb
- ProteinDPO, aligns ESM-IF1 with experimental stability fitness using Direct Preference Optimization
- Capable of improved binding affinity prediction and stabilized
Proteus: pioneering protein structure generation for enhanced designability and efficiency | ICML '24
- Proteus, an unconditional protein backbone diffusion model without pre-training by utilizing graph-based triangle methods and a multi-track interaction network
- Achieved
MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training
- Uses 2D Evolutionary Positional Encoding by RoPE, and flattens MSA for a 1D decoding problem
- MSA Generative Pre-Training, Rejective Fine-tuning, Reinforcement Learning from AlphaFold2
De novo Design of A Fusion Protein Tool for GPCR Research
- Design Fusion Protein to facilitate GPCR cryo-EM study with enhanced stability and rigidity
- Use RFdiffusion + AF2 workflow. Delete the third intracellular loop (ICL3) and generate 10-20 amino acids between ICL3 and
Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation
- Use ESM-2-650M embeddings from an unmasked sequence to predict masked one-at-a-time probability vectors (by MLP), reducing the need for multiple forward passes
- Combine OFS pseudo-perplexity technique within an
De novo generation of multi-target compounds using deep generative chemistry
@NatureComms
POLYGON, a VAE based model with reinforcement learning for programmatic generation of new polypharmacology compounds that inhibit multiple protein targets
Protein Set Transformer: A protein-based genome language model to power high diversity viromics
- Protein Set Transformer (PST) represents genomes as graphs, with proteins as nodes using ESM2 embeddings
- Encoder contextualizes protein nodes with different attention weights;
Structures/Backbone using Deep Learning:
- Foldseek
- SWAMPNN
- ProTokens
- ProSST
- Learning the Language of Protein Structure
- FoldToken 1 & 2
- ESM3
Non DL methods before 🤔:
- Geometricus and TERMs and More
Updated List of Recent Works on Tokenizing Protein Structures/Backbone using Deep Learning:
- Foldseek
- SWAMPNN
- ProTokens
- ProSST
- Learning the Language of Protein Structure
- FoldToken 1 & 2
Non DL methods before 🤔:
- Geometricus and TERMs and More
Thanks to everyone
Unsupervised domain classification of AlphaFold2-predicted protein structures
- Use Foldseek for all-against-all local alignment, followed by density-based clustering, and merge into metaclusters of protein domains
- Recover known folds from SCOP (94%) and CATH (86%) using a
ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction
- Use Protein Chain of Thought (ProCoT) to simulate signaling pathways with ProtTrans embeddings and step-by-step reasoning chains.
- Convert the Mol dataset into prompt-answer pairs for
Computational design of soluble and functional membrane protein analogues |
@Nature
- Uses an AlphaFold2-based pipeline coupled with ProteinMPNN for sequence optimization to design complex folds and soluble analogues
- Key proteins designed included Ig-like folds (IGFs),