agat_sp_fix_cds_phases.pl
DESCRIPTION
This script aims to fix the CDS phases. The script is compatible with incomplete gene models (Missing start, CDS multiple of 3 or not, i.e. with offset of 1 or 2) and + and - strand.
How this script works?
AGAT uses the fasta sequence to verify the CDS frame.
In case the CDS start by a start codon the phase of the first CDS piece is set to 0.
In the case there is no start codon and:
If there is only one stop codon in the sequence and it is located at the last position, the phase of the first CDS piece is set to 0.
If there is no stop codon, the phase of the first CDS piece is set to 0 (because sequence can be translated without premature stop codon).
If there is/are stop codon(s) in the middle of the sequence we re-execute the check with an offset of +2 nucleotides:
If there is only one stop codon in the sequence and it is located at the last position, the phase of the first CDS piece is set to 0.
If there is no stop codon, the phase of the first CDS piece is set to 0 (because sequence can be translated without premature stop codon).
If there is/are stop codon(s) in the middle of the sequence we re-execute the check with an offset of +1 nucleotide:
If there is only one stop codon in the sequence and it is located at the last position, the phase of the first CDS piece is set to 0.
If there is no stop codon, the phase of the first CDS piece is set to 0 (because sequence can be translated without premature stop codon).
If there is/are still stop codon(s) we keep original phase and throw a warning. In this last case it means we never succeded to make a translation without premature stop codon in all the 3 possible phases. Then in case of CDS made of multiple CDS pieces (i.e. discontinuous feature), the rest of the CDS pieces will be checked accordingly to the first CDS piece.
What is a phase?
For features of type "CDS", the phase indicates where the next codon begins
relative to the 5' end (where the 5' end of the CDS is relative to the strand
of the CDS feature) of the current CDS feature. For clarification the 5' end
for CDS features on the plus strand is the feature's start and and the 5' end
for CDS features on the minus strand is the feature's end. The phase is one of
the integers 0, 1, or 2, indicating the number of bases forward from the start
of the current CDS feature the next codon begins. A phase of "0" indicates that
a codon begins on the first nucleotide of the CDS feature (i.e. 0 bases forward),
a phase of "1" indicates that the codon begins at the second nucleotide of this
CDS feature and a phase of "2" indicates that the codon begins at the third
nucleotide of this region. Note that ‘Phase’ in the context of a GFF3 CDS
feature should not be confused with the similar concept of frame that is also a
common concept in bioinformatics. Frame is generally calculated as a value for
a given base relative to the start of the complete open reading frame (ORF) or
the codon (e.g. modulo 3) while CDS phase describes the start of the next codon
relative to a given CDS feature.
The phase is REQUIRED for all CDS features.
SYNOPSIS
agat_sp_fix_cds_phases.pl --gff infile.gff -f fasta [ -o outfile ]
agat_sp_fix_cds_phases.pl --help
OPTIONS
-g, --gff or -ref
Input GTF/GFF file.
-fa or --fasta
Input fasta file.
-v or --verbose
Add verbosity.
-o or --output
Output GFF file. If no output file is specified, the output will be written to STDOUT.
-c or --config
String - Input agat config file. By default AGAT takes as input agat_config.yaml file from the working directory if any, otherwise it takes the orignal agat_config.yaml shipped with AGAT. To get the agat_config.yaml locally type: "agat config --expose". The --config option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
-h or --help
Display this helpful text.