Learning about protein language models; it's pretty mind-blowing.
For example the UMAP cluster indicated in the image contains both human (blue) and worm (red) ribosomal proteins — not sequence homologs, but proteins with common functions.
To make this I downloaded the per-protein embeddings from Uniprot and just did a simple UMAP fit of the worm proteins, then mapped the human proteins using the same fit. They line up nicely, though there are lots of species-specific regions too.
@pmcarlton what’s UMAP? I’m not sure I understood the X,Y axes nor the implication of two proteins being at the same coordinates 😅😅
@pmcarlton got it ! Where do you generate these maps?
@pmcarlton you are very generous!!!
@Chl0e_Girard
So far it's more a toy for exploration than anything else but I hope to develop it into something useful. I need to find a student to help with it
@Chl0e_Girard
I just wrote a little walkthrough here: https://github.com/pmcarlton/umap-sequence-plotting
tldr: download data files from uniprot, install some python modules, run a small script, open the interactive plot in your browser.