**Pete** @pmcarlton@mstdn.science · Nov 10, 2022, 06:34

**Pete** @pmcarlton@mstdn.science · Nov 10, 2022, 06:34

Pete @pmcarlton@mstdn.science

Nov 10, 2022, 06:34

Learning about protein language models; it's pretty mind-blowing.
For example the UMAP cluster indicated in the image contains both human (blue) and worm (red) ribosomal proteins — not sequence homologs, but proteins with common functions.

To make this I downloaded the per-protein embeddings from Uniprot and just did a simple UMAP fit of the worm proteins, then mapped the human proteins using the same fit. They line up nicely, though there are lots of species-specific regions too.

59ba6b71065db22f.png

**Chloe Girard, PhD** @Chl0e_Girard@qoto.org · Nov 10, 2022, 08:39

**Chloe Girard, PhD** @Chl0e_Girard@qoto.org · Nov 10, 2022, 08:39

Nov 10, 2022, 08:39

Chloe Girard, PhD @Chl0e_Girard@qoto.org

@pmcarlton what’s UMAP? I’m not sure I understood the X,Y axes nor the implication of two proteins being at the same coordinates 😅😅

**Pete** @pmcarlton@mstdn.science · Nov 10, 2022, 09:41

**Pete** @pmcarlton@mstdn.science · Nov 10, 2022, 09:41

Nov 10, 2022, 09:41

Pete @pmcarlton@mstdn.science

@Chl0e_Girard That's because I gave a crap explanation! =) Each protein is represented by a vector of 1024 numbers derived from a language model (the part I do not understand yet), and UMAP is a method to project each vector into a 2-dimensional space (so X and Y axes are just arbitrary "positions") while trying to preserve the original spatial relationships. All the proteins that "look" similar in the vector representation should cluster nearby. Sadly I could not find a worm Iho1 this way!

**Chloe Girard, PhD** @Chl0e_Girard@qoto.org · Nov 13, 2022, 08:40

**Chloe Girard, PhD** @Chl0e_Girard@qoto.org · Nov 13, 2022, 08:40

Nov 13, 2022, 08:40

Chloe Girard, PhD @Chl0e_Girard@qoto.org

@pmcarlton got it ! Where do you generate these maps?

**Pete** @pmcarlton@mstdn.science · Nov 14, 2022, 04:15

**Pete** @pmcarlton@mstdn.science · Nov 14, 2022, 04:15

Nov 14, 2022, 04:15

Pete @pmcarlton@mstdn.science

@Chl0e_Girard
I just wrote a little walkthrough here: https://github.com/pmcarlton/umap-sequence-plotting

tldr: download data files from uniprot, install some python modules, run a small script, open the interactive plot in your browser.

**Chloe Girard, PhD** @Chl0e_Girard@qoto.org · 2022-11-14T11:59:50Z

Chloe Girard, PhD @Chl0e_Girard@qoto.org

@pmcarlton you are very generous!!!

Nov 14, 2022, 11:59 · · Metatext · · ·

**Pete** @pmcarlton@mstdn.science · Nov 15, 2022, 00:30

**Pete** @pmcarlton@mstdn.science · Nov 15, 2022, 00:30

Nov 15, 2022, 00:30

Pete @pmcarlton@mstdn.science

@Chl0e_Girard
So far it's more a toy for exploration than anything else but I hope to develop it into something useful. I need to find a student to help with it

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…