fajie yuan Profile
fajie yuan

@duguyuan

Followers
1K
Following
835
Statuses
600

Assistant Prof at Westlake University

Hangzhou,China
Joined September 2014
Don't wanna be here? Send us removal request.
@duguyuan
fajie yuan
1 month
We release our protein chatGPT, Evola! 🌟 Evola comes in two versions: 10B & 80B. The 80B model has a 1.3B Saprot encoder & a 70B LLaMA3 decoder. Trained on 546 protein question-text pairs with an 150 billion word tokens! 💡🔬
Tweet media one
@aipulserx
DailyHealthcareAI
1 month
How can we effectively decode and understand the complex molecular language of proteins to unlock their functional secrets at scale?@biorxivpreprint @Westlake_Uni "Decoding the Molecular Language of Proteins with Evola" • Scientists have developed Evola, an 80 billion parameter frontier protein-language generative model that combines information from protein sequences, structures, and user queries to decode protein function. It's trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, integrating Direct Preference Optimization (DPO) and Retrieval-Augmented Generation (RAG) to improve response quality. • While protein structure prediction has seen major breakthroughs with tools like AlphaFold, there remains a critical gap between structural determination and functional understanding, with fewer than half a million proteins having expert-curated functional annotations despite hundreds of millions of known sequences. Previous attempts at protein language models have been limited by training data of only 500,000 to 3 million protein-text pairs. • The model architecture includes SaProt-650M or SaProt-1.3B as protein encoders, sequence compressors with 6-8 layers, and Llama3 decoders (8B or 70B parameters). Training used DeepSpeed on 32 NVIDIA A100 GPUs for the 10B model and FSDP on 64 H800 GPUs for the 80B model. The training data combined Swiss-Prot annotations (16 million triples) and ProTrek-derived data (530 million triples). • Performance evaluation showed Evola-80B achieved a GPT score of 74.10 compared to 40.49 for Deepseek-v3 and 37.07 for GPT-4. On EC number prediction for novel proteins, Evola achieved 41.2% accuracy for exact matches and 66.8% for three-digit matches, comparable to specialized classification models. The model demonstrated strong performance across both general test sets and more challenging hard subsets with low sequence similarity to training data. Authors: Xibin Zhou , Chenchen Han , Yingqi Zhang , Jin Su , Kai Zhuang , Shiyu Jiang et. al @duguyuan Link:
Tweet media one
22
138
609
@duguyuan
fajie yuan
1 day
🧬 Just discovered something cool! Want to find & engineer proteins? It's super easy now: → Search billions of proteins with ProTrek → Edit your protein with ColabSaprot (mutation module) No coding. No install headaches. Just a few clicks ! ✨
0
1
8
@duguyuan
fajie yuan
7 days
0
0
2
@duguyuan
fajie yuan
8 days
@jeremydratcliff @hhlee substantial improvement?
1
0
0
@duguyuan
fajie yuan
14 days
RT @FengyDai: We conducted a preliminary comparison of Pinal with other text-to-protein models in a dry lab setting. Please note that this…
0
1
0
@duguyuan
fajie yuan
15 days
@joe_fenrir @david_kochman @ATinyGreenCell Try generate more times and choose some relatively smaller ProTrek score. High ProTrek score means high similarity between protein and function. If it is new structure, is it still PETases.? Structure or shape is conserved.
0
0
1
@duguyuan
fajie yuan
15 days
@joe_fenrir @ATinyGreenCell Combing this sequence with the above one. "ASGLGIALALGELGADVTDADIGRHFDRYSSVASTSAGVELDANEIVVIGNSARPTGGLAIGHGLGADPDDIAGIAAALRALGVAGGAGPDAADLDRIVFVFVKAAASPNGTTPGVVVPLDDDDDDLSTHHARSAAGGVVAGATGDDVVVVSGSGEHQVPPGGGVVAVVARRR"
0
0
1
@duguyuan
fajie yuan
15 days
@joe_fenrir @ATinyGreenCell "MSPTRVRAYRVPMTGPADVSGLRALLAAGGIDPRSIVAVIGKTEGNGCVNDFTRAFATLALLELLGERLGCSPEEVAERVAFVMSGGTEGVLSPHLTVFTREEVDAAPAGAAGGRLAIGVARTPEFAPEEIGTPAQRDIVADAVRAAMADAGITDPRDVHFVQVKCPLLTQARIDAVRARGRSTATEDTYRSMGFSRG" i generated one using my phone. add the seq of next tweet. space is not enough
0
0
2
@duguyuan
fajie yuan
15 days
@joe_fenrir @ATinyGreenCell ProTrek scores are important - if the score is less than 12, it might not match our expectations. Trying different prompts can help improve results for some cases.
0
0
2
@duguyuan
fajie yuan
15 days
@joe_fenrir @ATinyGreenCell you can try it I am not sure for the enzyme u mentioned.
0
0
2
@duguyuan
fajie yuan
15 days
@joe_fenrir @ATinyGreenCell No, it generates structures rather than searching. It can generate structures not in the PDB shown in the paper I remembered. The model was trained on hundreds of billions of tokens, giving it some generalization ability (for example, A + B).
0
0
1
@duguyuan
fajie yuan
15 days
@joe_fenrir @ATinyGreenCell Hoo,we don’t have such a big assumption. The model just learns from a very large dataset and then is able to predict when your query is somehow related to one or more training examples.
2
0
2
@duguyuan
fajie yuan
17 days
@david_kochman @ATinyGreenCell ProTrek Score>12 means text and seq have good alignment. >15 is high similarity. Top designs' ProTrek score are almost 20 for this case.
0
0
0
@duguyuan
fajie yuan
17 days
@david_kochman @ATinyGreenCell I think we can select some design with relatively smaller ProTrek score. High score means high similarity between text and seq. It is reasonable some high score designs have high similarity with existing proteins.
0
0
0
@duguyuan
fajie yuan
17 days
@anthonygitter
Anthony Gitter
17 days
@duguyuan That seems to work well. BLAST in FPbase and UniProt matches fluorescent proteins, but with 30-35% sequence identity.
1
0
0
@duguyuan
fajie yuan
17 days
@david_kochman @ATinyGreenCell This is Just one generation. You can use Pinal to generate 10000 if u like(using weight in huggingface),it should have some new I guess
0
0
0
@duguyuan
fajie yuan
17 days
@ATinyGreenCell @david_kochman Many proteins designed only have 20-30 sequence identity with existing proteins. I think it depends on specific protein
0
0
0
@duguyuan
fajie yuan
17 days
@ATinyGreenCell @david_kochman May try more cases. Happy to see you guys comments. I think it is difficult to generate completely new structures. Even yes, we cannot tell before wet lab. May compare with ESM3. :). science is step-by-step advance.
0
0
4