Artwork
iconShare
 
Manage episode 522098848 series 3682575
Content provided by [email protected] (Gustavo Barra) and Gustavo Barra. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by [email protected] (Gustavo Barra) and Gustavo Barra or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://staging.podcastplayer.com/legal.

️ Episode 215: Protein Set Transformer for high-diversity viromics

In this episode of PaperCast Base by Base, we explore Protein Set Transformer (PST) is a protein-based genome language model that represents genomes as sets of proteins to improve genome and protein representations across diverse viral datasets

Study Highlights:
PST embeds proteins with ESM2, concatenates positional and strand vectors, contextualizes proteins with a multi-head attention encoder, and produces genome embeddings via a learnable weighted decoder pooling. The foundation PST-TL models were pretrained on >100k dereplicated viral genomes encoding >6M proteins using a triplet-loss objective with PointSwap augmentation and evaluated on IMG/VR v4 and MGnify soil virus test sets. PST-TL outperformed other protein- and nucleotide-based methods at recovering genome–genome relationships, including remote relationships, and its protein embeddings clustered structural capsid folds and late-gene functional modules. PST improved annotation transfer for hypothetical proteins via embedding and structure-aware clustering and boosted viral host-species prediction when used in a graph link-prediction framework.

Conclusion:
PST provides transferable genome- and protein-level embeddings that strengthen representation, annotation, and host-prediction tasks for diverse viral and microbial genomics applications

Music:
Enjoy the music based on this article at the end of the episode.

Reference:
Martin, C., Gitter, A., Anantharaman, K. Protein Set Transformer: a protein-based genome language model to power high-diversity viromics. Nat Commun (2025). https://doi.org/10.1038/s41467-025-66049-4

License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/

Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00

Official website https://basebybase.com

Castos player https://basebybase.castos.com

On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics.

Episode link: https://basebybase.castos.com/episodes/protein-set-transformer

  continue reading

216 episodes