PhaVIP: Phage VIrion Protein classification based on chaos game representation and Vision TransformerMotivation: As viruses that mainly infect bacteria, phages are key players
across a wide range of ecosystems. Analyzing phage proteins is indispensable
for understanding phages' functions and roles in microbiomes. High-throughput
sequencing enables us to obtain phages in different microbiomes with low cost.
However, compared to the fast accumulation of newly identified phages, phage
protein classification remains difficult. In particular, a fundamental need is
to annotate virion proteins, the structural proteins such as major tail,
baseplate etc. Although there are experimental methods for virion protein
identification, they are too expensive or time-consuming, leaving a large
number of proteins unclassified. Thus, there is a great demand to develop a
computational method for fast and accurate phage virion protein classification.
Results: In this work, we adapted the state-of-the-art image classification
model, Vision Transformer, to conduct virion protein classification. By
encoding protein sequences into unique images using chaos gaming
representation, we can leverage Vision Transformer to learn both local and
global features from sequence ``images''. Our method, PhaVIP, has two main
functions: classifying PVP and non-PVP sequences and annotating the types of
PVP, such as capsid and tail. We tested PhaVIP on several datasets with
increasing difficulty and benchmarked it against alternative tools. The
experimental results show that PhaVIP has superior performance. After
validating the performance of PhaVIP, we investigated two applications that can
use the output of PhaVIP: phage taxonomy classification and phage host
prediction. The results show the benefit of using classified proteins rather
than all proteins.
arxiv.org