![fajie yuan Profile](https://pbs.twimg.com/profile_images/519943687397064704/SJQbTshv_x96.jpeg)
fajie yuan
@duguyuan
Followers
1K
Following
835
Statuses
600
Assistant Prof at Westlake University
Hangzhou,China
Joined September 2014
We release our protein chatGPT, Evola! 🌟 Evola comes in two versions: 10B & 80B. The 80B model has a 1.3B Saprot encoder & a 70B LLaMA3 decoder. Trained on 546 protein question-text pairs with an 150 billion word tokens! 💡🔬
How can we effectively decode and understand the complex molecular language of proteins to unlock their functional secrets at scale?@biorxivpreprint @Westlake_Uni "Decoding the Molecular Language of Proteins with Evola" • Scientists have developed Evola, an 80 billion parameter frontier protein-language generative model that combines information from protein sequences, structures, and user queries to decode protein function. It's trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, integrating Direct Preference Optimization (DPO) and Retrieval-Augmented Generation (RAG) to improve response quality. • While protein structure prediction has seen major breakthroughs with tools like AlphaFold, there remains a critical gap between structural determination and functional understanding, with fewer than half a million proteins having expert-curated functional annotations despite hundreds of millions of known sequences. Previous attempts at protein language models have been limited by training data of only 500,000 to 3 million protein-text pairs. • The model architecture includes SaProt-650M or SaProt-1.3B as protein encoders, sequence compressors with 6-8 layers, and Llama3 decoders (8B or 70B parameters). Training used DeepSpeed on 32 NVIDIA A100 GPUs for the 10B model and FSDP on 64 H800 GPUs for the 80B model. The training data combined Swiss-Prot annotations (16 million triples) and ProTrek-derived data (530 million triples). • Performance evaluation showed Evola-80B achieved a GPT score of 74.10 compared to 40.49 for Deepseek-v3 and 37.07 for GPT-4. On EC number prediction for novel proteins, Evola achieved 41.2% accuracy for exact matches and 66.8% for three-digit matches, comparable to specialized classification models. The model demonstrated strong performance across both general test sets and more challenging hard subsets with low sequence similarity to training data. Authors: Xibin Zhou , Chenchen Han , Yingqi Zhang , Jin Su , Kai Zhuang , Shiyu Jiang et. al @duguyuan Link:
22
138
609
@joe_fenrir @david_kochman @ATinyGreenCell Try generate more times and choose some relatively smaller ProTrek score. High ProTrek score means high similarity between protein and function. If it is new structure, is it still PETases.? Structure or shape is conserved.
0
0
1
@joe_fenrir @ATinyGreenCell Combing this sequence with the above one. "ASGLGIALALGELGADVTDADIGRHFDRYSSVASTSAGVELDANEIVVIGNSARPTGGLAIGHGLGADPDDIAGIAAALRALGVAGGAGPDAADLDRIVFVFVKAAASPNGTTPGVVVPLDDDDDDLSTHHARSAAGGVVAGATGDDVVVVSGSGEHQVPPGGGVVAVVARRR"
0
0
1
@joe_fenrir @ATinyGreenCell "MSPTRVRAYRVPMTGPADVSGLRALLAAGGIDPRSIVAVIGKTEGNGCVNDFTRAFATLALLELLGERLGCSPEEVAERVAFVMSGGTEGVLSPHLTVFTREEVDAAPAGAAGGRLAIGVARTPEFAPEEIGTPAQRDIVADAVRAAMADAGITDPRDVHFVQVKCPLLTQARIDAVRARGRSTATEDTYRSMGFSRG" i generated one using my phone. add the seq of next tweet. space is not enough
0
0
2
@joe_fenrir @ATinyGreenCell ProTrek scores are important - if the score is less than 12, it might not match our expectations. Trying different prompts can help improve results for some cases.
0
0
2
@joe_fenrir @ATinyGreenCell No, it generates structures rather than searching. It can generate structures not in the PDB shown in the paper I remembered. The model was trained on hundreds of billions of tokens, giving it some generalization ability (for example, A + B).
0
0
1
@joe_fenrir @ATinyGreenCell Hoo,we don’t have such a big assumption. The model just learns from a very large dataset and then is able to predict when your query is somehow related to one or more training examples.
2
0
2
@david_kochman @ATinyGreenCell ProTrek Score>12 means text and seq have good alignment. >15 is high similarity. Top designs' ProTrek score are almost 20 for this case.
0
0
0
@david_kochman @ATinyGreenCell I think we can select some design with relatively smaller ProTrek score. High score means high similarity between text and seq. It is reasonable some high score designs have high similarity with existing proteins.
0
0
0
@david_kochman @ATinyGreenCell see this thread
@duguyuan That seems to work well. BLAST in FPbase and UniProt matches fluorescent proteins, but with 30-35% sequence identity.
1
0
0
@david_kochman @ATinyGreenCell This is Just one generation. You can use Pinal to generate 10000 if u like(using weight in huggingface),it should have some new I guess
0
0
0
@ATinyGreenCell @david_kochman Many proteins designed only have 20-30 sequence identity with existing proteins. I think it depends on specific protein
0
0
0
@ATinyGreenCell @david_kochman May try more cases. Happy to see you guys comments. I think it is difficult to generate completely new structures. Even yes, we cannot tell before wet lab. May compare with ESM3. :). science is step-by-step advance.
0
0
4