fajie yuan @duguyuan profile

fajie yuan

@duguyuan

Followers

1K

Following

835

Statuses

600

Assistant Prof at Westlake University

Hangzhou,China

Joined September 2014

Don't wanna be here? Send us removal request.

fajie yuan

@duguyuan

1 month

We release our protein chatGPT, Evola! 🌟 Evola comes in two versions: 10B & 80B. The 80B model has a 1.3B Saprot encoder & a 70B LLaMA3 decoder. Trained on 546 protein question-text pairs with an 150 billion word tokens! 💡🔬

DailyHealthcareAI

@aipulserx

1 month

How can we effectively decode and understand the complex molecular language of proteins to unlock their functional secrets at scale?@biorxivpreprint @Westlake_Uni "Decoding the Molecular Language of Proteins with Evola" • Scientists have developed Evola, an 80 billion parameter frontier protein-language generative model that combines information from protein sequences, structures, and user queries to decode protein function. It's trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, integrating Direct Preference Optimization (DPO) and Retrieval-Augmented Generation (RAG) to improve response quality. • While protein structure prediction has seen major breakthroughs with tools like AlphaFold, there remains a critical gap between structural determination and functional understanding, with fewer than half a million proteins having expert-curated functional annotations despite hundreds of millions of known sequences. Previous attempts at protein language models have been limited by training data of only 500,000 to 3 million protein-text pairs. • The model architecture includes SaProt-650M or SaProt-1.3B as protein encoders, sequence compressors with 6-8 layers, and Llama3 decoders (8B or 70B parameters). Training used DeepSpeed on 32 NVIDIA A100 GPUs for the 10B model and FSDP on 64 H800 GPUs for the 80B model. The training data combined Swiss-Prot annotations (16 million triples) and ProTrek-derived data (530 million triples). • Performance evaluation showed Evola-80B achieved a GPT score of 74.10 compared to 40.49 for Deepseek-v3 and 37.07 for GPT-4. On EC number prediction for novel proteins, Evola achieved 41.2% accuracy for exact matches and 66.8% for three-digit matches, comparable to specialized classification models. The model demonstrated strong performance across both general test sets and more challenging hard subsets with low sequence similarity to training data. Authors: Xibin Zhou , Chenchen Han , Yingqi Zhang , Jin Su , Kai Zhuang , Shiyu Jiang et. al @duguyuan Link:

22

138

609

fajie yuan

@duguyuan

1 day

🧬 Just discovered something cool! Want to find & engineer proteins? It's super easy now: → Search billions of proteins with ProTrek → Edit your protein with ColabSaprot (mutation module) No coding. No install headaches. Just a few clicks ! ✨

0

1

8

fajie yuan

@duguyuan

7 days

0

2

fajie yuan

@duguyuan

8 days

@jeremydratcliff @hhlee substantial improvement？

1

0

fajie yuan

@duguyuan

14 days

RT @FengyDai: We conducted a preliminary comparison of Pinal with other text-to-protein models in a dry lab setting. Please note that this…

0

1

0

fajie yuan

@duguyuan

15 days

@joe_fenrir @david_kochman @ATinyGreenCell Try generate more times and choose some relatively smaller ProTrek score. High ProTrek score means high similarity between protein and function. If it is new structure, is it still PETases.? Structure or shape is conserved.

0

1

fajie yuan

@duguyuan

15 days

@joe_fenrir @ATinyGreenCell Combing this sequence with the above one. "ASGLGIALALGELGADVTDADIGRHFDRYSSVASTSAGVELDANEIVVIGNSARPTGGLAIGHGLGADPDDIAGIAAALRALGVAGGAGPDAADLDRIVFVFVKAAASPNGTTPGVVVPLDDDDDDLSTHHARSAAGGVVAGATGDDVVVVSGSGEHQVPPGGGVVAVVARRR"

0

1

fajie yuan

@duguyuan

15 days

@joe_fenrir @ATinyGreenCell "MSPTRVRAYRVPMTGPADVSGLRALLAAGGIDPRSIVAVIGKTEGNGCVNDFTRAFATLALLELLGERLGCSPEEVAERVAFVMSGGTEGVLSPHLTVFTREEVDAAPAGAAGGRLAIGVARTPEFAPEEIGTPAQRDIVADAVRAAMADAGITDPRDVHFVQVKCPLLTQARIDAVRARGRSTATEDTYRSMGFSRG" i generated one using my phone. add the seq of next tweet. space is not enough

0

2

fajie yuan

@duguyuan

15 days

@joe_fenrir @ATinyGreenCell ProTrek scores are important - if the score is less than 12, it might not match our expectations. Trying different prompts can help improve results for some cases.

0

2

fajie yuan

@duguyuan

15 days

@joe_fenrir @ATinyGreenCell you can try it I am not sure for the enzyme u mentioned.

0

2

fajie yuan

@duguyuan

15 days

@joe_fenrir @ATinyGreenCell No, it generates structures rather than searching. It can generate structures not in the PDB shown in the paper I remembered. The model was trained on hundreds of billions of tokens, giving it some generalization ability (for example, A + B).

0

1

fajie yuan

@duguyuan

15 days

@joe_fenrir @ATinyGreenCell Hoo,we don’t have such a big assumption. The model just learns from a very large dataset and then is able to predict when your query is somehow related to one or more training examples.

2

0

2

fajie yuan

@duguyuan

17 days

@david_kochman @ATinyGreenCell ProTrek Score>12 means text and seq have good alignment. >15 is high similarity. Top designs' ProTrek score are almost 20 for this case.

0

fajie yuan

@duguyuan

17 days

@david_kochman @ATinyGreenCell I think we can select some design with relatively smaller ProTrek score. High score means high similarity between text and seq. It is reasonable some high score designs have high similarity with existing proteins.

0

fajie yuan

@duguyuan

17 days

@david_kochman @ATinyGreenCell see this thread

Anthony Gitter

@anthonygitter

17 days

@duguyuan That seems to work well. BLAST in FPbase and UniProt matches fluorescent proteins, but with 30-35% sequence identity.

1

0

fajie yuan

@duguyuan

17 days

@david_kochman @ATinyGreenCell This is Just one generation. You can use Pinal to generate 10000 if u like(using weight in huggingface)，it should have some new I guess

0

fajie yuan

@duguyuan

17 days

@ATinyGreenCell @david_kochman Many proteins designed only have 20-30 sequence identity with existing proteins. I think it depends on specific protein

0

fajie yuan

@duguyuan

17 days

@ATinyGreenCell @david_kochman May try more cases. Happy to see you guys comments. I think it is difficult to generate completely new structures. Even yes, we cannot tell before wet lab. May compare with ESM3. :). science is step-by-step advance.

0

4