Enhancing Protein Language Models with Structure-based Encoder and Pre-trainingProtein language models (PLMs) pre-trained on large-scale protein sequence
corpora have achieved impressive performance on various downstream protein
understanding tasks. Despite the ability to implicitly capture inter-residue
contact information, transformer-based PLMs cannot encode protein structures
explicitly for better structure-aware protein representations. Besides, the
power of pre-training on available protein structures has not been explored for
improving these PLMs, though structures are important to determine functions.
To tackle these limitations, in this work, we enhance the PLMs with
structure-based encoder and pre-training. We first explore feasible model
architectures to combine the advantages of a state-of-the-art PLM (i.e.,
ESM-1b1) and a state-of-the-art protein structure encoder (i.e., GearNet). We
empirically verify the ESM-GearNet that connects two encoders in a series way
as the most effective combination model. To further improve the effectiveness
of ESM-GearNet, we pre-train it on massive unlabeled protein structures with
contrastive learning, which aligns representations of co-occurring subsequences
so as to capture their biological correlation. Extensive experiments on EC and
GO protein function prediction benchmarks demonstrate the superiority of
ESM-GearNet over previous PLMs and structure encoders, and clear performance
gains are further achieved by structure-based pre-training upon ESM-GearNet.
Our implementation is available at
https://github.com/DeepGraphLearning/GearNet.
arxiv.org