Wiki

Table of Contents

📚  General options

The following parameters are common when running phabox2:

--task
    Select a program to run:
    end_to_end    || Run phamer, phagcn, phatyp, and cherry once (default)
    phamer        || Virus identification
    phagcn        || Taxonomy classification
    phatyp        || Lifestyle prediction
    cherry        || Host prediction
    phavip        || Protein annotation
    contamination || Contamination/proviurs detection
    votu          || vOTU grouping (ANI-based or AAI-based)
    tree          || Build phylogenetic trees based on marker genes

--contigs
    Path of the input FASTA file (required)

--proteins
    FASTA file of predicted proteins. (optional)

--len
    Filter the length of contigs || default: 3000
    Contigs with length smaller than this value will not proceed

📚  Special options

Please note that end_to_end task will run phamer, phagcn, cherry, phatyp, and phavip together. Thus, each task's options can also be used for the end_to_end task.

In addition, prediction with non-virus and low-confidence will not be used in the following taxonomy, host, and lifestyle prediction tasks.

The following parameters will be used in specific tasks:

📕  PhaMer (Virus identification)

In-task options:

--reject
    Reject sequences in which the percent proteins aligned to known phages is smaller than the value.
    Default: 10
    Range from 0 to 20

If the proportion is too low, the prediction for downstream analysis will be unreliable.

📕  PhaGCN (Taxonomy)

In-task options:

The options below are used to generate a network for virus-virus connections. The current parameters are optimized for the ICTV 2024 and are highly accurate for grouping genus-level vOTUs. When making changes, make sure you understand 100% what they are.

--aai
    Average amino acids identity  || default: 75 || range from 0 to 100

--share
    Minimum shared number of proteins || default: 15 || range from 0 to 100

--pcov
    Protein-based coverage || default: 80 || range from 0 to 100

--draw (local version only)
    Draw network examples for the query virus relationship. || default: N || Y or N

--draw is used to plot sub-networks containing the query virus. We use it to generate visualization for our web server. However, it will only print the top 10 largest sub-networks, so we do not recommend that users use it. We have provided the complete network for visualization (network_edges.tsv and network_nodes.tsv file) please check it out via: here

📕  CHERRY (Host)

In-task options:

The options below are used to generate a network for virus-virus connections. The current parameters are optimized for the ICTV 2024 and are highly accurate for grouping genus-level vOTUs. When making changes, make sure you understand 100% what they are.

--aai
    Average amino acids identity  || default: 75 || range from 0 to 100

--share
    Minimum shared number of proteins || default: 15 || range from 0 to 100

--pcov
    Protein-based coverage || default: 80 || range from 0 to 100

--draw (local version only)
    Draw network examples for the query virus relationship. || default: N || Y or N

--draw is used to plot sub-networks containing the query virus. We use it to generate visualization for our web server. However, it will only print the top 10 largest sub-networks, so we do not recommend that users use it. We have provided the complete network for visualization (network_edges.tsv and network_nodes.tsv file) please check it out via: here

The options below are used to align contigs to CRISPRs.

--cpident
    Alignment identity for CRISPRs || default: 90 || range from 90 to 100

--ccov
    Alignment coverage for CRISPRs || default: 90 || range from 0 to 100

--blast (local version only)
    BLAST program for CRISPRs || default: blastn || blastn or blastn-short
    blastn-short will lead to more sensitive results but require more time to execute the program

The default parameters are optimized for predicting prokaryotic hosts for the virus with 98% accuracy (data from the NCBI RefSeq database). When making changes, make sure you understand 100% what they are.

📕  PhaTYP (Lifestyle)

In-task options:

There are no additional options for lifestyle prediction. Only need to follow the general options.

📕  PhaVIP (annotation)

Please note that running task end_to_end, phamer, phagcn, phatyp, and cherry, will automatically run phavip. The output files are the same.

In-task options:

--sensitive (local version only)
    Sensitive when search for the prokaryotic genes || default: N ||  Y or N
    Y will lead to more sensitive results but require more time to execute the program

📗  Contamination

In-task options:

There are no additional options for lifestyle prediction. Only need to follow the general options.

📘  vOTU grouping

In-task options:

--mode
    Mode for clustering ANI based or AAI based || default: ANI || ANI or AAI

AAI-based options:

--aai
    Average amino acids identity for AAI based genus grouping || default: 75 || range from 0 to 100

--pcov
    Protein-level coverage for AAI based genus grouping || default: 80 || range from 0 to 100

--share
    Minimum shared number of proteins for AAI based genus grouping || default: 15 || range from 0 to 100

ANI-based options:

--ani
    Alignment identity for ANI-based clustering  || default: 95 || range from 0 to 100

--tcov
    Alignment coverage for ANI-based clustering || default: 85 || range from 0 to 100

📙  Pylogenetic tree

In-task options:

--marker
    A list of markers used to generate tree || default: terl portal
    You can choose more than one marker to generate the tree from below:

    The marker genes were obtained from the RefSeq 2024:
        endolysin      || 91% prokaryotic virus have endolysin
        holin          || 75% prokaryotic virus have holin
        head           || 77% prokaryotic virus have marjor head
        portal protein || 84% prokaryotic viruses have portal
        terl           || 92% prokaryotic viruses have terminase large subunit

        Using combinations of these markers can improve the accuracy of the tree
        But will decrease the number of sequences in the tree.

--mcov
    Alignment coverage for matching marker genes || default: 50 || range from 0 to 100

--mpident
    Alignment identity for matching marker genes || default: 25 || range from 0 to 100

--msa (local version only)
    Whether run msa || default: N || Y or N
    Y will run msa for the marker genes using mafft
    But this will require more time to execute the program

--tree (local version only)
    Whether build a tree || default: N || Y or N
    Y will generate the tree based on the marker genes using FastTree
    But this will require more time to execute the program

Table of Contents

All the outputs can be found in the PATH_TO_OUT/final_prediction/ folder. According to the --task you run, the total number of files may differ.

📕  PhaMer (Virus identification)

final_prediction
├── phamer_prediction.tsv
└── phamer_supplementary
     ├── all_predicted_contigs.fa    || DNA sequences > --length
     ├── all_predicted_protein.fa    || Proteins predicted by prodigal-gv
     ├── gene_annotation.tsv         || protein annotation based on blastp
     ├── predicted_virus.fa          || Vrial DNA sequences
     ├── predicted_virus_protein.fa  || Vrial proteins
     ├── alignment_results.tab       || blastp results against db
     └── uncertain_sequences_for_contamination_task.fa      || please run contamination task

The main output phamer_prediction.tsv is generated in tabular-separated (TSV) format composed of six fields:

Accession       Length  Pred    Proportion      PhaMerScore     PhaMerConfidence
example_0       29445   virus   0.1             1.0             lower than reject threshold
example_2       5971    virus   0.86            1.0             high-confidence
  1. Accession: the accession or the name of the input contigs.
  2. Length: the length of input contigs.
  3. Pred: virus or non-virus.
  4. Proportion: the proportion of the proteins that can be aligned to the virus database (from 0 to 1).
  5. PhaMerScore: the prediction score given by the deep learning model.
  6. PhaMerConfidence: the confidence of prediction, determined by both Proportion and PhaMerScore.
    • high-confidence
    • medium-confidence
    • low-confidence
    • lower than viral score threshold.
    • lower than reject threshold (according to the --reject parameter, default: 0.1).

For the virus with low-confidence or lower than viral score threshold, we recommend you to run the --task contamination to check their sequence quality.

📕  PhaGCN (Taxonomy)

final_prediction
├── phagcn_prediction.tsv
└── phagcn_supplementary
   ├── all_predicted_contigs.fa   || DNA sequences > --length
   ├── all_predicted_protein.fa   || Proteins predicted by prodigal-gv
   ├── alignment_results.tab      || blastp results against db
   ├── gene_annotation.tsv        || protein annotation based on blastp
    ├── phagcn_network_edges.tsv   || network file for cytoscape
    └── phagcn_network_nodes.tsv   || network file for cytoscape

The main output phagcn_prediction.tsv is generated in tabular-separated (TSV) format composed of six fields:

Accession       Length  Lineage PhaGCNScore     Genus   GenusCluster
example_0       29445   superkingdom:Viruses;clade:Duplodnaviria;kingdom:Heunggongvirae;phylum:Uroviricota;class:Caudoviricetes 1.00;1.00;1.00;1.00;1.00;1.00;0.58;0.58 -       singleton
example_103     11376   superkingdom:Viruses;clade:Duplodnaviria;kingdom:Heunggongvirae;phylum:Uroviricota;class:Caudoviricetes;genus:Jasminevirus      1.00;1.00;1.00;1.00;1.00;1.00;1.00      Jasminevirus      known_genus
  1. Accession: the accession or the name of the input contigs.
  2. Length: the length of input contigs.
  3. Lineage: the predicted taxonomy lineage (NCBI version) of the contigs. Each rank is separated by the ';'.
  4. PhaGCNScore: the predicted score for each rank in the lineage. Each rank is separated by the ';'.
  5. Genus: whether the contig has a genus level name ('-' means unknown).
  6. GenusCluster: if the Genus is '-', the program will assign a genus-level grouping result: group_idx (idx = 1, 2, 3, ...) or singleton. This can be viewed as genus-level OTUs based on the average shared protein identities between sequences.

📕  CHERRY (Host)

final_prediction
├── cherry_prediction.tsv
└── cherry_supplementary
    ├── all_predicted_contigs.fa   || DNA sequences > --length
    ├── all_predicted_protein.fa   || Proteins predicted by prodigal-gv
    ├── alignment_results.tab      || blastp results against db
    ├── gene_annotation.tsv        || protein annotation based on blastp
    ├── cherry_network_edges.tsv   || network file for cytoscape
    └── cherry_network_nodes.tsv   || network file for cytoscape

The main output cherry_prediction.tsv is generated in tabular-separated (TSV) format composed of five fields:

Accession       Length  Host                                            CHERRYScore     Method
example_99      6612    species:Halobiforma_lacisalsi                   1.00            CRISPR-based (DB)
example_98      13996   species:Salinispora_arenicola                   1.00            CRISPR-based (DB)
  1. Accession: the accession or the name of the input contigs.
  2. Length: the length of input contigs.
  3. Host: the predicted host (NCBI taxonomy) of the contigs. '-' means unknown host.
  4. CHERRYScore: the predicted score from the model.
  5. Method:
    • CRISPR-based(MAG): CRISPRs alignment results from provided MAG (if any)
    • CRISPR-based(DB): CRISPRs alignment results from database.
    • AAI-based: predicting host based on virus-similarity

📕  PhaTYP (Lifestyle)

final_prediction
├── phatyp_prediction.tsv
└── phatyp_supplementary
    ├── all_predicted_contigs.fa   || DNA sequences > --length
    ├── all_predicted_protein.fa   || Proteins predicted by prodigal-gv
    ├── alignment_results.tab      || blastp results against db
    └── gene_annotation.tsv        || protein annotation based on blastp

The main output phatyp_prediction.tsv is generated in tabular-separated (TSV) format composed of four fields:

Accession       Length  TYPE            PhaTYPScore
example_0       29445   virulent        1.0
example_2       5971    temperate       1.0
  1. Accession: the accession or the name of the input contigs.
  2. Length: the length of input contigs.
  3. TYPE: virulent or temperate (virus).
  4. PhaTYPScore: the prediction score given by the deep learning model.

📕  PhaVIP (Protein annotation)

Please note that running task end_to_end, phamer, phagcn, phatyp, and cherry, will automatically run phavip. The output files are the same.

final_prediction
└── phavip_supplementary
    ├── all_predicted_contigs.fa   || DNA sequences > --length
    ├── all_predicted_protein.fa   || Proteins predicted by prodigal-gv
    ├── alignment_results.tab      || blastp results against db
    └── gene_annotation.tsv        || protein annotation based on blastp

The main output gene_annotation.tsv is generated in tabular-separated (TSV) format composed of four fields:

Genome  ORF     Start   End     Strand  GC      Annotation
example_0 example_0_1       347     790     1       0.619   hypothetical protein
example_0 example_0_2       790     1944    1       0.591   hypothetical protein
example_0 example_0_3       2454    2627    1       0.58    capsid accessory protein
  1. Genome: the accession or the name of the input contigs.
  2. ORF: the ID of the translated protein.
  3. Start: start position on the genome./li>
  4. End: end position on the genome.
  5. Strand: forward (1) or backward(-1).
  6. GC: GC content.
  7. Annotation: protein annotation.

📚  End to end task

final_prediction
├── final_prediction_summary.tsv
├── phamer_supplementary
│   ├── all_predicted_contigs.fa
│   ├── all_predicted_protein.fa
│   ├── gene_annotation.tsv         || outputs of phavip
│   ├── predicted_virus.fa
│   ├── predicted_virus_protein.fa
│   ├── alignment_results.tab
│   └── uncertain_sequences_for_contamination_task.fa      || please run contamination task
├── phagcn_supplementary
│   ├── phagcn_network_edges.tsv
│   └── phagcn_network_nodes.tsv
├── cherry_supplementary
│   ├── cherry_network_edges.tsv
│   └── cherry_network_nodes.tsv
└── phatyp_supplementary

In the end-to-end mode, except for the aforementioned xxx_prediction.tsv files, a final_prediction_summary.tsv is generated by merging the outputs of all subprograms.

In addition, prediction with non-virus and low-confidence will not be used in the following taxonomy, host, and lifestyle prediction tasks.

📗  Contamination

final_prediction
├── contamination_prediction.tsv
└── contamination_supplementary
    ├── proviruses.fa                              || proteinal provirues
    ├── candidate_provirus.tsv                     || information of the provirus
    └── marker_gene_from_contamination_search.tsv  || marker gene annotation

The main output contamination_prediction.tsv is generated in tabular-separated (TSV) format composed of nine fields:

Accession      Length   Total_genes  Viral_genes  Prokaryotic_genes  Kmer_freq  Contamination  Provirus   Pure_viral
example_270    6617            6            2            0           1.0        0              No         High quality
example_271    17630           28           9            0           1.0        0              No         High quality
  1. Accession: the accession or the name of the input contigs.
  2. Length: the length of input contigs.
  3. Total_genes: number of genes in the contigs (predicted by prodigal-gv)
  4. Viral_genes: number of viral marker genes
  5. Prokaryotic_genes: number of prokaryotic marker genes
  6. Kmer_freq: average frequency of 20-mer.
    • This is a value to estimate the copy number of the genes; usually, the Kmer_freq of 99.9% virus is less than 1.25.
  7. Contamination:
  8. Provirus: Whether the sequence is a provirus
  9. Pure_viral: High quality or Medium quality or Low quality

📘  vOTU grouping

final_prediction
├── ANI_based_vOTU.tsv (ANI-based)
└── AAI_based_vOTU.tsv (AAI-based)

The main output xxx_based_vOTU.tsv is generated in tabular-separated (TSV) format composed of four fields:

Sequence        vOTU            Representative  Length
contig_33       group_19        contig_33       49448
contig_34       group_19        contig_33       4484
  1. Accession: the accession or the name of the input contigs.
  2. vOTU: the cluster ID.
  3. Representative: the representative genome.
  4. Length: the length of input contigs.

📙  Phylogenetic tree

final_prediction
├── combined_marker.msa (if msa =='Y')    || concatenate the MSA between different marker (local version only)
├── combined.tree (if tree == 'Y')        || phylogenetic tree based on FastTree (local version only)
└── tree_supplementary
    └── finded_marker_xxx_combined_db.fa  || the fined marker and database marker

Table of Contents

Some outputs from PhaBOX can help you to draw figures for your research. We will show some examples below. Hope they will help

📘  Portein visualization

The protein annotation file gene_annotation.tsv can be used to generate the protein organization using PyGenomeViz.

# Make sure you have installed the PyGenomeViz
# pip install pygenomeviz

#Load the data
import pandas as pd
data = pd.read_csv('gene_annotation.tsv', sep='\t')

# Convert your data into a format suitable for pygenomeviz. You need to extract relevant information:
# Extract the relevant columns
annotations = []
for index, row in data.iterrows():
    annotations.append({
        'seq_id': row['Genome'],
        'start': row['Start'],
        'end': row['End'],
        'strand': row['Strand'],
        'annotation': row['Annotation'],
    })

from pygenomeviz import GenomeViz
# Initialize GenomeViz
gv = GenomeViz()
# Add each annotation as a track
for ann in annotations:
    gv.add_feature(
        seq_id=ann['seq_id'],
        start=ann['start'],
        end=ann['end'],
        strand=ann['strand'],
        label=ann['annotation']
    )

# Render the visualization
gv.render()
# Save to a file
gv.savefig('genome_viz.png')
# Or display it
gv.show()

An example is:

Genome Annotation
Genome Annotation


📘  Network visualization

The network files xxx_edges.csv and xxx_nodes.csv can be used to input into Cytoscape.

Step1 Import the network: 
  - Go to File > Import > Table from File
  - select xxx_edges.csv
Step2 Import the network: 
  - Go to File > Import > Table from File
  - select xxx_nodes.csv
Setp3 Adjust the visualization:
  - choose a layout
  - color the nodes

An example is:

Network Visualization
Network Visualization


📘  Tree visualization

The tree file combined.tree can be used as input to iTOL. The metadata of the reference genomes can be found in PhaBOX2's database RefVirus.csv.

An example is:

Tree Visualization
Tree Visualization