Manage episode 522098848 series 3682575
️ Episode 215: Protein Set Transformer for high-diversity viromics
In this episode of PaperCast Base by Base, we explore Protein Set Transformer (PST) is a protein-based genome language model that represents genomes as sets of proteins to improve genome and protein representations across diverse viral datasets
Study Highlights:
PST embeds proteins with ESM2, concatenates positional and strand vectors, contextualizes proteins with a multi-head attention encoder, and produces genome embeddings via a learnable weighted decoder pooling. The foundation PST-TL models were pretrained on >100k dereplicated viral genomes encoding >6M proteins using a triplet-loss objective with PointSwap augmentation and evaluated on IMG/VR v4 and MGnify soil virus test sets. PST-TL outperformed other protein- and nucleotide-based methods at recovering genome–genome relationships, including remote relationships, and its protein embeddings clustered structural capsid folds and late-gene functional modules. PST improved annotation transfer for hypothetical proteins via embedding and structure-aware clustering and boosted viral host-species prediction when used in a graph link-prediction framework.
Conclusion:
PST provides transferable genome- and protein-level embeddings that strengthen representation, annotation, and host-prediction tasks for diverse viral and microbial genomics applications
Music:
Enjoy the music based on this article at the end of the episode.
Reference:
Martin, C., Gitter, A., Anantharaman, K. Protein Set Transformer: a protein-based genome language model to power high-diversity viromics. Nat Commun (2025). https://doi.org/10.1038/s41467-025-66049-4
License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/
Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00
Official website https://basebybase.com
Castos player https://basebybase.castos.com
On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics.
Episode link: https://basebybase.castos.com/episodes/protein-set-transformer
216 episodes