Implementation
Components of the PhaBOX
The workflow of PhaBOX is presented in terms of data collection and curation, feature encoding, and model prediction, as shown in the figure below:
Construction of protein cluster database
The protein clusters (PCs) are constructed on the phage sequence, and the procedure can be listed below:
- In order to be consistent with the gene prediction process of the query sequences, we apply gene finding and protein translation for the DNA genomes using Prodigal.
- We run all-against-all DIAMOND BLASTP on the predicted proteins. Protein pairs with alignment E-value less than 1e-3 are used to create a protein similarity network, where the nodes represent proteins, and the edges represent the recorded alignments. The edge weight encodes the corresponding alignment's E-value.
- Markov clustering algorithm (MCL) is employed to similar group proteins into the clusters using default parameters. All the clusters that contain fewer than two proteins are removed.
Feature encoding
The feature encoding of our tools is based on the alignment results against the protein cluster database. There are two major features derived from the PCs alignment:
- Protein cluster sentences
- Protein cluster sharing network
For each query contig, PhaBOX will run Prodigal for protein translation. Then, DIAMOND BLASTP is employed to search against the protein cluster database. Then:
Protein cluster sentences : PhaBOX will identify the matched protein clusters for the translated proteins by conducting similarity search. To be specific, it will identify the reference protein incurring the smallest E-value and assign the query with this reference protein's cluster. PhaBOX will record both the ID of the PC and the position of the protein in the query contig to construct the PC sentences.
Protein cluster sharing network : PhaBOX will calculate the expected number of sharing at least an observed number of protein clusters between query contigs and reference genomes. To be specific, PhaBOX computes the probability that any two sequences containing a and b protein clusters share at least c clusters. Then, PhaBOX can connect contigs and reference genomes by measuring whether the similarity between two sequences is significant.
Model construction
Each subprogram incroperate different feature encoding methods and utilize task-specific deep learning model for prediction. For example, PhaMer and PhaTYP are NLP models that fit to sequence analysis problem (phage identification and lifestyle prediction), while PhaGCN and Cherry employ graph-based model that help to investigate sequence-sequence interactions/accosiation (taxa classification and host prediction). Detailed information of each model can be found in the following sections.
PhaMer employs a contextualized embedding model from natural language processing (NLP) to learn protein-associated patterns in phages. Specifically, by converting a sequence into a sentence composed of protein-based tokens, we employ the embedding model to learn both the protein composition and also their associations in phage sequences.
(Fig. 1A) First, we construct the vocabulary containing protein-based tokens, which are essentially protein clusters with high similarities. Then, we apply DIAMOND BLASTP to record the presence of tokens in training phage sequences.
(Fig. 1B) Then, the tokens and their positions will be fed into Transformer for contextual-aware embedding. The embedding layer and the self-attention mechanism in Transformer enable the model to learn the importance of each protein cluster and the protein-protein associations. In addition, by using the phages' host genomes as the negative samples in the training data, the model can learn from the hard cases and thus is more likely to achieve high precision in real data.
PhaMer can directly use the whole sequences for training, avoiding the bias of segmentation. We rigorously tested PhaMer on multiple independent datasets covering different scenarios including the RefSeq dataset, short contigs, simulated metagenomic data, mock metagenomic data, and the public IMG/VR dataset. We compared PhaMer with four competitive learning-based tools and one alignment-based tool (VirSorter) based on a third-party review [2]. The results show that PhaMer competes favorably against the existing tools.
Given the enormous diversity of phages and the sheer amount of unlabeled phages, we formulate the phage classification problem as a semi-supervised learning problem. We choose the GCN as the learning model and combine the strength of both the alignment-based and the learning-based methods.
The input to PhaGCN is a knowledge graph. There are two key components in the knowledge graph: node encoding and edge construction. The node is a numerical vector learned from contigs using a CNN. The edge encodes features from both the sequence similarity and the organization of genes.
Fig. 1 contains the major components for node and edge construction.
- (Fig. 1 A1-A3) To encode a sequence using a node, a pre-trained convolutional neural network (CNN) is adopted to capture features from the input DNA sequence. The CNN model is trained to convert proximate substrings into vectors of high similarity.
- (Fig. 1 B1-B4) The edge construction consists of several steps. We employ a greedy search algorithm to find the best BLASTP results (E-value less than 1e-5) between the translated proteins from the contigs and the database .
- (Fig. 1 B5) Then the Markov clustering algorithm (MCL) is applied to generate protein clusters from the BLASTP result .
- (Fig. 1 B6-B7) Based on the results of BLASTP (sequence similarity) and MCL (shared proteins), we define the edges between sequences (contigs and reference genomes) using two metrics: P_weight and E_weight.
- (Fig. 1 C1) By combining the node’s features and edges, we construct the knowledge graph and feed it to the GCN to classify new phage contigs.
We compared PhaGCN with three state-of-the-art models specifically designed for phage classification: Phage Orthologous Groups (POG), vConTACT 2.0, and ClassiPhage. The experimental results demonstrated that PhaGCN outperforms other popular methods in classifying new phage contigs.
PhaTYP is a BERT-based model that learns the protein composition and associations from phage genomes to classify the lifestyles of phages.
To address the difficulties of classifying incomplete genomes with limited training data, we divide the lifestyle classification into two tasks: a self-supervised learning task (Fig. 1 A) and a fine-tuning task (Fig. 1 B).
- (Fig. 1A) To circumvent the problem that only a limited number of phages have lifestyle annotations, we applied self-supervised learning to learn protein association features from all the phage genomes using Masked Language Model (Masked LM), aiming to recover the original protein from the masked protein sentences. This task allows us to utilize all the phage genomes for training regardless of available lifestyle annotations.
- (Fig. 1B) In the second task, we will fine-tune the Masked LM on phages with known lifestyle annotations for classification. To ensure that the model can handle short contigs, we apply data augmentation by generating fragments ranging from 100bp to 10,000bp for training.
We evaluated PhaTYP on contigs of different lengths and contigs assembled from real metagenomic data. The benchmark results against the state-of-the-art methods show that PhaTYP not only achieves the highest performance on complete genomes but also improves the accuracy on short contigs by over 10%.
CHERRY can predict the hosts' taxa (phylum to species) for newly identified viruses based on a multimodal graph.
(Fig. 1A) The multimodal graph incorporates multiple types of interactions, including protein organization information between viruses, the sequence similarity between viruses and prokaryotes, and the CRISPR signals . In addition, we use k-mer frequency as the node features to enhance the learning ability.
(Fig. 1B) Rather than directly using these features for prediction, we design an encoder-decoder structure to learn the best embedding for input sequences and predict the interactions between viruses and prokaryotes. The graph convolutional encoder utilizes the topological structure of the multimodal graph and thus, features from both training and testing sequences can be incorporated to embed new node features.
(Fig. 1C) Then, a link prediction decoder is adopted to estimate how likely a given virus-prokaryote pair forms a real infection.
Another feature behind the high accuracy of CHERRY is the construction of the negative training set. The dataset for training is highly imbalanced, with the real host as the positive data and all other prokaryotes as negative data. We carefully addressed this issue using negative sampling. Instead of using a random subset of the negative set for training the model, we apply end-to-end optimization and negative sampling to automatically learn the hard cases during training.
To demonstrate the reliability of our method, we rigorously tested CHERRY on multiple independent datasets including the RefSeq dataset, simulated short contigs, and metagenomic datasets. We compared CHERRY with WIsH, PHP, HoPhage, VPF-Class, RaFAH, HostG, vHULK, PHIST, DeepHost, PHIAF, and VHM-net. The results show that CHERRY competes favorably against the state-of-the-art tools.
PhaVIP is a python library for phage protein annotation. It has two functions. First, it can classify a protein into either PVPs or non-PVPs (binary classification task). Second, it can assign a more detailed annotation for predicted PVPs, such as major capsid, major tail, and portal (multi-class classification task).
we adapted the state-of-the-art image classification model, Vision Transformer, to conduct virion protein classification for phages. By encoding protein sequences into unique images using chaos gaming representation, significant patterns of different proteins can be visualized. Then, Vision Transformer can capture and learn the patterns from the images and predicts the label for query proteins. Our rigorous test on several datasets shows that PhaVIP has robust performance on low-similarity data and outperforms the existing methods. The applications of predicted virion proteins in phage taxonomy classification and host prediction show that PhaVIP can provide valuable features to improve the performance in the downstream phage analysis tasks.
The pipelines of PhaVIP is presented in the figure below:
The webserver construction
The architecture of the PhageBOX server consists of two major components: a client web interface and a server backend. The client web interface is responsible for submitting the tasks and displaying the output. It was implemented by JS, CSS, jQuery, Bootstrap, and their extension packages. Specifically, the sequence similarity was visualized by BlasterJS, the protein sequence viewer was presented using pViz, and the topological graph structure was drawn using Plotly in R. The server backend is responsible for interacting with users through the web interface, handling users’ input, and executing the whole prediction process. The former interface was implemented by the fast and lightweight python-based Flask framework and the extension python packages. The server backend puts the user’s submission into the queueing system, where the python maintains a thread pool with customizable size. Then a child thread will be created and executed the jobs asynchronously. During the process, a lite SQL database stores and updates the job information and status. Thus, the architecture brings a better user experience by decoupling the client web interface that requires prompt response speed and the server backend that handles time-consuming jobs. The scheduling method also allows the architecture to be amenable for expansions to add new computational facilities to meet the increasing demand in predicting ever-accumulating genome-scale data.