Wiki-PhaBOX

--task
    Select a program to run:
    end_to_end    || Run phamer, phagcn, phatyp, and cherry once (default)
    phamer        || Virus identification
    phagcn        || Taxonomy classification
    phatyp        || Lifestyle prediction
    cherry        || Host prediction
    phavip        || Protein annotation
    contamination || Contamination/proviurs detection
    votu          || vOTU grouping (ANI-based or AAI-based)
    tree          || Build phylogenetic trees based on marker genes

--contigs
    Path of the input FASTA file (required)

--proteins
    FASTA file of predicted proteins. (optional)

--len
    Filter the length of contigs || default: 3000
    Contigs with length smaller than this value will not proceed

📚 Special options

Please note that end_to_end task will run phamer, phagcn, cherry, phatyp, and phavip together. Thus, each task's options can also be used for the end_to_end task.

In addition, prediction with non-virus and low-confidence will not be used in the following taxonomy, host, and lifestyle prediction tasks.

The following parameters will be used in specific tasks:

📕 PhaMer (Virus identification)

In-task options:

--reject
    Reject sequences in which the percent proteins aligned to known phages is smaller than the value.
    Default: 10
    Range from 0 to 20

If the proportion is too low, the prediction for downstream analysis will be unreliable.

📕 PhaGCN (Taxonomy)

In-task options:

The options below are used to generate a network for virus-virus connections. The current parameters are optimized for the ICTV 2024 and are highly accurate for grouping genus-level vOTUs. When making changes, make sure you understand 100% what they are.

--aai
    Average amino acids identity  || default: 75 || range from 0 to 100

--share
    Minimum shared number of proteins || default: 15 || range from 0 to 100

--pcov
    Protein-based coverage || default: 80 || range from 0 to 100

--draw (local version only)
    Draw network examples for the query virus relationship. || default: N || Y or N

--draw is used to plot sub-networks containing the query virus. We use it to generate visualization for our web server. However, it will only print the top 10 largest sub-networks, so we do not recommend that users use it. We have provided the complete network for visualization (network_edges.tsv and network_nodes.tsv file) please check it out via: here

📕 CHERRY (Host)

In-task options:

The options below are used to generate a network for virus-virus connections. The current parameters are optimized for the ICTV 2024 and are highly accurate for grouping genus-level vOTUs. When making changes, make sure you understand 100% what they are.

--aai
    Average amino acids identity  || default: 75 || range from 0 to 100

--share
    Minimum shared number of proteins || default: 15 || range from 0 to 100

--pcov
    Protein-based coverage || default: 80 || range from 0 to 100

--draw (local version only)
    Draw network examples for the query virus relationship. || default: N || Y or N

--draw is used to plot sub-networks containing the query virus. We use it to generate visualization for our web server. However, it will only print the top 10 largest sub-networks, so we do not recommend that users use it. We have provided the complete network for visualization (network_edges.tsv and network_nodes.tsv file) please check it out via: here

The options below are used to align contigs to CRISPRs.

--cpident
    Alignment identity for CRISPRs || default: 90 || range from 90 to 100

--ccov
    Alignment coverage for CRISPRs || default: 90 || range from 0 to 100

--blast (local version only)
    BLAST program for CRISPRs || default: blastn || blastn or blastn-short
    blastn-short will lead to more sensitive results but require more time to execute the program

The default parameters are optimized for predicting prokaryotic hosts for the virus with 98% accuracy (data from the NCBI RefSeq database). When making changes, make sure you understand 100% what they are.

📕 PhaTYP (Lifestyle)

In-task options:

There are no additional options for lifestyle prediction. Only need to follow the general options.

📕 PhaVIP (annotation)

Please note that running task end_to_end, phamer, phagcn, phatyp, and cherry, will automatically run phavip. The output files are the same.

In-task options:

--sensitive (local version only)
    Sensitive when search for the prokaryotic genes || default: N ||  Y or N
    Y will lead to more sensitive results but require more time to execute the program

📗 Contamination

In-task options:

There are no additional options for lifestyle prediction. Only need to follow the general options.

📘 vOTU grouping

In-task options:

--mode
    Mode for clustering ANI based or AAI based || default: ANI || ANI or AAI

AAI-based options:

--aai
    Average amino acids identity for AAI based genus grouping || default: 75 || range from 0 to 100

--pcov
    Protein-level coverage for AAI based genus grouping || default: 80 || range from 0 to 100

--share
    Minimum shared number of proteins for AAI based genus grouping || default: 15 || range from 0 to 100

ANI-based options:

--ani
    Alignment identity for ANI-based clustering  || default: 95 || range from 0 to 100

--tcov
    Alignment coverage for ANI-based clustering || default: 85 || range from 0 to 100

📙 Pylogenetic tree

In-task options:

--marker
    A list of markers used to generate tree || default: terl portal
    You can choose more than one marker to generate the tree from below:

    The marker genes were obtained from the RefSeq 2024:
        endolysin      || 91% prokaryotic virus have endolysin
        holin          || 75% prokaryotic virus have holin
        head           || 77% prokaryotic virus have marjor head
        portal protein || 84% prokaryotic viruses have portal
        terl           || 92% prokaryotic viruses have terminase large subunit

        Using combinations of these markers can improve the accuracy of the tree
        But will decrease the number of sequences in the tree.

--mcov
    Alignment coverage for matching marker genes || default: 50 || range from 0 to 100

--mpident
    Alignment identity for matching marker genes || default: 25 || range from 0 to 100

--msa (local version only)
    Whether run msa || default: N || Y or N
    Y will run msa for the marker genes using mafft
    But this will require more time to execute the program

--tree (local version only)
    Whether build a tree || default: N || Y or N
    Y will generate the tree based on the marker genes using FastTree
    But this will require more time to execute the program

final_prediction
├── phamer_prediction.tsv
└── phamer_supplementary
     ├── all_predicted_contigs.fa    || DNA sequences > --length
     ├── all_predicted_protein.fa    || Proteins predicted by prodigal-gv
     ├── gene_annotation.tsv         || protein annotation based on blastp
     ├── predicted_virus.fa          || Vrial DNA sequences
     ├── predicted_virus_protein.fa  || Vrial proteins
     ├── alignment_results.tab       || blastp results against db
     └── uncertain_sequences_for_contamination_task.fa      || please run contamination task

The main output phamer_prediction.tsv is generated in tabular-separated (TSV) format composed of six fields:

Accession       Length  Pred    Proportion      PhaMerScore     PhaMerConfidence
example_0       29445   virus   0.1             1.0             lower than reject threshold
example_2       5971    virus   0.86            1.0             high-confidence

Accession: the accession or the name of the input contigs.
Length: the length of input contigs.
Pred: virus or non-virus.
Proportion: the proportion of the proteins that can be aligned to the virus database (from 0 to 1).
PhaMerScore: the prediction score given by the deep learning model.
PhaMerConfidence: the confidence of prediction, determined by both Proportion and PhaMerScore.
- high-confidence
- medium-confidence
- low-confidence
- lower than viral score threshold.
- lower than reject threshold (according to the --reject parameter, default: 0.1).

For the virus with low-confidence or lower than viral score threshold, we recommend you to run the --task contamination to check their sequence quality.

📕 PhaGCN (Taxonomy)

final_prediction
├── phagcn_prediction.tsv
└── phagcn_supplementary
   ├── all_predicted_contigs.fa   || DNA sequences > --length
   ├── all_predicted_protein.fa   || Proteins predicted by prodigal-gv
   ├── alignment_results.tab      || blastp results against db
   ├── gene_annotation.tsv        || protein annotation based on blastp
    ├── phagcn_network_edges.tsv   || network file for cytoscape
    └── phagcn_network_nodes.tsv   || network file for cytoscape

The main output phagcn_prediction.tsv is generated in tabular-separated (TSV) format composed of six fields:

Accession       Length  Lineage PhaGCNScore     Genus   GenusCluster
example_0       29445   superkingdom:Viruses;clade:Duplodnaviria;kingdom:Heunggongvirae;phylum:Uroviricota;class:Caudoviricetes 1.00;1.00;1.00;1.00;1.00;1.00;0.58;0.58 -       singleton
example_103     11376   superkingdom:Viruses;clade:Duplodnaviria;kingdom:Heunggongvirae;phylum:Uroviricota;class:Caudoviricetes;genus:Jasminevirus      1.00;1.00;1.00;1.00;1.00;1.00;1.00      Jasminevirus      known_genus

Accession: the accession or the name of the input contigs.
Length: the length of input contigs.
Lineage: the predicted taxonomy lineage (NCBI version) of the contigs. Each rank is separated by the ';'.
PhaGCNScore: the predicted score for each rank in the lineage. Each rank is separated by the ';'.
Genus: whether the contig has a genus level name ('-' means unknown).
GenusCluster: if the Genus is '-', the program will assign a genus-level grouping result: group_idx (idx = 1, 2, 3, ...) or singleton. This can be viewed as genus-level OTUs based on the average shared protein identities between sequences.

📕 CHERRY (Host)

final_prediction
├── cherry_prediction.tsv
└── cherry_supplementary
    ├── all_predicted_contigs.fa   || DNA sequences > --length
    ├── all_predicted_protein.fa   || Proteins predicted by prodigal-gv
    ├── alignment_results.tab      || blastp results against db
    ├── gene_annotation.tsv        || protein annotation based on blastp
    ├── cherry_network_edges.tsv   || network file for cytoscape
    └── cherry_network_nodes.tsv   || network file for cytoscape

The main output cherry_prediction.tsv is generated in tabular-separated (TSV) format composed of five fields:

Accession       Length  Host                                            CHERRYScore     Method
example_99      6612    species:Halobiforma_lacisalsi                   1.00            CRISPR-based (DB)
example_98      13996   species:Salinispora_arenicola                   1.00            CRISPR-based (DB)

Accession: the accession or the name of the input contigs.
Length: the length of input contigs.
Host: the predicted host (NCBI taxonomy) of the contigs. '-' means unknown host.
CHERRYScore: the predicted score from the model.
Method:
- CRISPR-based(MAG): CRISPRs alignment results from provided MAG (if any)
- CRISPR-based(DB): CRISPRs alignment results from database.
- AAI-based: predicting host based on virus-similarity

📕 PhaTYP (Lifestyle)

final_prediction
├── phatyp_prediction.tsv
└── phatyp_supplementary
    ├── all_predicted_contigs.fa   || DNA sequences > --length
    ├── all_predicted_protein.fa   || Proteins predicted by prodigal-gv
    ├── alignment_results.tab      || blastp results against db
    └── gene_annotation.tsv        || protein annotation based on blastp

The main output phatyp_prediction.tsv is generated in tabular-separated (TSV) format composed of four fields:

Accession       Length  TYPE            PhaTYPScore
example_0       29445   virulent        1.0
example_2       5971    temperate       1.0

Accession: the accession or the name of the input contigs.
Length: the length of input contigs.
TYPE: virulent or temperate (virus).
PhaTYPScore: the prediction score given by the deep learning model.

📕 PhaVIP (Protein annotation)

Please note that running task end_to_end, phamer, phagcn, phatyp, and cherry, will automatically run phavip. The output files are the same.

final_prediction
└── phavip_supplementary
    ├── all_predicted_contigs.fa   || DNA sequences > --length
    ├── all_predicted_protein.fa   || Proteins predicted by prodigal-gv
    ├── alignment_results.tab      || blastp results against db
    └── gene_annotation.tsv        || protein annotation based on blastp

The main output gene_annotation.tsv is generated in tabular-separated (TSV) format composed of four fields:

Genome  ORF     Start   End     Strand  GC      Annotation
example_0 example_0_1       347     790     1       0.619   hypothetical protein
example_0 example_0_2       790     1944    1       0.591   hypothetical protein
example_0 example_0_3       2454    2627    1       0.58    capsid accessory protein

Genome: the accession or the name of the input contigs.
ORF: the ID of the translated protein.
Start: start position on the genome./li>
End: end position on the genome.
Strand: forward (1) or backward(-1).
GC: GC content.
Annotation: protein annotation.

📚 End to end task

final_prediction
├── final_prediction_summary.tsv
├── phamer_supplementary
│   ├── all_predicted_contigs.fa
│   ├── all_predicted_protein.fa
│   ├── gene_annotation.tsv         || outputs of phavip
│   ├── predicted_virus.fa
│   ├── predicted_virus_protein.fa
│   ├── alignment_results.tab
│   └── uncertain_sequences_for_contamination_task.fa      || please run contamination task
├── phagcn_supplementary
│   ├── phagcn_network_edges.tsv
│   └── phagcn_network_nodes.tsv
├── cherry_supplementary
│   ├── cherry_network_edges.tsv
│   └── cherry_network_nodes.tsv
└── phatyp_supplementary

In the end-to-end mode, except for the aforementioned xxx_prediction.tsv files, a final_prediction_summary.tsv is generated by merging the outputs of all subprograms.

In addition, prediction with non-virus and low-confidence will not be used in the following taxonomy, host, and lifestyle prediction tasks.

📗 Contamination

final_prediction
├── contamination_prediction.tsv
└── contamination_supplementary
    ├── proviruses.fa                              || proteinal provirues
    ├── candidate_provirus.tsv                     || information of the provirus
    └── marker_gene_from_contamination_search.tsv  || marker gene annotation

The main output contamination_prediction.tsv is generated in tabular-separated (TSV) format composed of nine fields:

Accession      Length   Total_genes  Viral_genes  Prokaryotic_genes  Kmer_freq  Contamination  Provirus   Pure_viral
example_270    6617            6            2            0           1.0        0              No         High quality
example_271    17630           28           9            0           1.0        0              No         High quality

Accession: the accession or the name of the input contigs.
Length: the length of input contigs.
Total_genes: number of genes in the contigs (predicted by prodigal-gv)
Viral_genes: number of viral marker genes
Prokaryotic_genes: number of prokaryotic marker genes
Kmer_freq: average frequency of 20-mer.
- This is a value to estimate the copy number of the genes; usually, the Kmer_freq of 99.9% virus is less than 1.25.
Contamination:
Provirus: Whether the sequence is a provirus
Pure_viral: High quality or Medium quality or Low quality

📘 vOTU grouping

final_prediction
├── ANI_based_vOTU.tsv (ANI-based)
└── AAI_based_vOTU.tsv (AAI-based)

The main output xxx_based_vOTU.tsv is generated in tabular-separated (TSV) format composed of four fields:

Sequence        vOTU            Representative  Length
contig_33       group_19        contig_33       49448
contig_34       group_19        contig_33       4484

Accession: the accession or the name of the input contigs.
vOTU: the cluster ID.
Representative: the representative genome.
Length: the length of input contigs.

📙 Phylogenetic tree

final_prediction
├── combined_marker.msa (if msa =='Y')    || concatenate the MSA between different marker (local version only)
├── combined.tree (if tree == 'Y')        || phylogenetic tree based on FastTree (local version only)
└── tree_supplementary
    └── finded_marker_xxx_combined_db.fa  || the fined marker and database marker
    └── finded_marker_xxx_without_db.fa  || the found marker without database marker

# Make sure you have installed the PyGenomeViz
# pip install pygenomeviz

#Load the data
import pandas as pd
data = pd.read_csv('gene_annotation.tsv', sep='\t')

# Convert your data into a format suitable for pygenomeviz. You need to extract relevant information:
# Extract the relevant columns
annotations = []
for index, row in data.iterrows():
    annotations.append({
        'seq_id': row['Genome'],
        'start': row['Start'],
        'end': row['End'],
        'strand': row['Strand'],
        'annotation': row['Annotation'],
    })

from pygenomeviz import GenomeViz
# Initialize GenomeViz
gv = GenomeViz()
# Add each annotation as a track
for ann in annotations:
    gv.add_feature(
        seq_id=ann['seq_id'],
        start=ann['start'],
        end=ann['end'],
        strand=ann['strand'],
        label=ann['annotation']
    )

# Render the visualization
gv.render()
# Save to a file
gv.savefig('genome_viz.png')
# Or display it
gv.show()

An example is:

Genome Annotation

📘 Network visualization

The network files xxx_edges.csv and xxx_nodes.csv can be used to input into Cytoscape.

Step1 Import the network: 
  - Go to File > Import > Table from File
  - select xxx_edges.csv
Step2 Import the network: 
  - Go to File > Import > Table from File
  - select xxx_nodes.csv
Setp3 Adjust the visualization:
  - choose a layout
  - color the nodes

An example is:

Network Visualization

📘 Tree visualization

The tree file combined.tree can be used as input to iTOL. The metadata of the reference genomes can be found in PhaBOX2's database RefVirus.csv.

An example is:

Tree Visualization

Wiki

Table of Contents

📚 General options

📚 Special options

📕 PhaMer (Virus identification)

📕 PhaGCN (Taxonomy)

📕 CHERRY (Host)

📕 PhaTYP (Lifestyle)

📕 PhaVIP (annotation)

📗 Contamination

📘 vOTU grouping

📙 Pylogenetic tree

Table of Contents

📕 PhaMer (Virus identification)

📕 PhaGCN (Taxonomy)

📕 CHERRY (Host)

📕 PhaTYP (Lifestyle)

📕 PhaVIP (Protein annotation)

📚 End to end task

📗 Contamination

📘 vOTU grouping

📙 Phylogenetic tree

Table of Contents

📘 Portein visualization

📘 Network visualization

📘 Tree visualization

Wiki

Parameter Options

Table of Contents

📚 General options

📚 Special options

📕 PhaMer (Virus identification)

📕 PhaGCN (Taxonomy)

📕 CHERRY (Host)

📕 PhaTYP (Lifestyle)

📕 PhaVIP (annotation)

📗 Contamination

📘 vOTU grouping

📙 Pylogenetic tree

Output Formats

Table of Contents

📕 PhaMer (Virus identification)

📕 PhaGCN (Taxonomy)

📕 CHERRY (Host)

📕 PhaTYP (Lifestyle)

📕 PhaVIP (Protein annotation)

📚 End to end task

📗 Contamination

📘 vOTU grouping

📙 Phylogenetic tree

Special usage of PhaBOX outputs

Table of Contents

📘 Portein visualization

📘 Network visualization

📘 Tree visualization