agat_sp_manage_functional_annotation.pl¶

DESCRIPTION¶

The script take a gff3 file as input and blast and/or interpro output in order to attach functional annotation to corresponding features within the gff file.

>The blast against Protein Database (outfmt 6) allows to fill the field/attribute NAME for gene and PRODUCT for mRNA.

>The Interpro result (.tsv) file allows to fill the DBXREF field/attribute with pfam, tigr, interpro, GO, KEGG, etc... terms data.

With the <id> option the script will change all the ID field by an Uniq ID created from the given prefix, a letter to specify the kind of feature (G,T,C,E,U), and the feature number.

The result is written to the specified output file, or to STDOUT.

About the TSV format from interproscan: &##x3d;======================================

The TSV format presents the match data in columns as follows:

Protein Accession (e.g. P51587)
Sequence MD5 digest (e.g. 14086411a2cdf1c4cba63020e1622579)
Sequence Length (e.g. 3418)
Analysis (e.g. Pfam / PRINTS / Gene3D)
Signature Accession (e.g. PF09103 / G3DSA:2.40.50.140)
Signature Description (e.g. BRCA2 repeat profile)
Start location
Stop location
Score - is the e-value (or score) of the match reported by member database method (e.g. 3.1E-52)
Status - is the status of the match (T: true)
Date - is the date of the run
(InterPro annotations - accession (e.g. IPR002093) - optional column; only displayed if -iprlookup option is switched on)
(InterPro annotations - description (e.g. BRCA2 repeat) - optional column; only displayed if -iprlookup option is switched on)
(GO annotations (e.g. GO:0005515) - optional column; only displayed if --goterms option is switched on)
(Pathways annotations (e.g. REACT_71) - optional column; only displayed if --pathways option is switched on)

P.S: The 9th column contains most of time e-value, but can contain also score (e.g Prosite). To understand the difference: https://myhits.isb-sib.ch/cgi-bin/help?doc=scores.html

About the outfmt 6 from blast: &##x3d;=============================

 qseqid  query (e.g., gene) sequence id
 sseqid  subject (e.g., reference genome) sequence id
 pident  percentage of identical matches
 length  alignment length
 mismatch  number of mismatches
 gapopen   number of gap openings
 qstart  start of alignment in query
 qend  end of alignment in query
 sstart  start of alignment in subject
  send  end of alignment in subject
  evalue  expect value
  bitscore  bit score

Currently the best e-value win... That means another hit with a lower e-value (but still over the defined threshold anyway) even if it has a better PE value will not be reported.

SYNOPSIS¶

agat_sp_manage_functional_annotation.pl -f infile.gff [ -b blast_infile --db uniprot.fasta -i interpro_infile.tsv --id ABCDEF --output outfile ]
agat_sp_manage_functional_annotation.pl --help

OPTIONS¶

-f, --reffile,-ref , --gff or --gff3

String - Input GTF/GFF file.
-b or --blast

String - Input blast ( outfmt 6 = tabular ) file that will be used to complement the features read from the first file (specified with --ref).
-d or --db

String - The fasta file that has been used as DB for the blast. Gene names and products/descriptions will be fished from this file.
--be or --blast_evalue

Integer - Maximum e-value to keep the annotation from the blast file. By default 1e-6.
--pe

Integer - The PE (protein existence) in the uniprot header indicates the type of evidence that supports the existence of the protein. You can decide until which protein existence level you want to consider to lift the finctional information. Default 5.

1. Experimental evidence at protein level 2. Experimental evidence at transcript level 3. Protein inferred from homology 4. Protein predicted 5. Protein uncertain
-i or --interpro

String - Input interpro file (.tsv) that will be used to complement the features read from the first file (specified with --ref).
-id

String - This option will changed the id name. It will create from id prefix (usually 6 letters) given as input, uniq IDs like prefixE00000000001. Where E mean exon. Instead E we can have C for CDS, G for gene, T for mRNA, U for Utr. In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID collectively represent a signle feature.
-idau

Boolean - This option (id all uniq) is similar to -id option but Id of features that share an ID collectively will be change by different and uniq ID.
-nb

Integer - Usefull only if -id is used. This option is used to define the number that will be used to begin the numbering. By default begin by 1.
-o or --output

String - Output GFF file. If no output file is specified, the output will be written to STDOUT.
-v

Boolean - Verbose, for debug purpose.
-h or --help

Boolean - Display this helpful text.