GFF to BED conversion
It exists many GFF formats and many GTF formats (see here for a complete review) and many tools to perform the conversion. We will try to see in this review the main differences.
Table of Contents
Test summary
tool | Comment |
---|---|
AGAT | default RGB color to 255,0,0 |
PASA | Particular 3rd column that contains a list of names |
bedops | each gff feature give one line. Only the 6 first colums are correct |
Kent utils | extra coma at the end of 11th and 12th column |
The GFF file to convert
The test file is a GFF3 file:
##gff-version 3
# This is a test sample
scaffold625 maker gene 337818 343277 . + . ID=CLUHARG00000005458;Name=TUBB3_2
scaffold625 maker mRNA 337818 343277 . + . ID=CLUHART00000008717;Parent=CLUHARG00000005458
scaffold625 maker tss 337916 337918 . + . ID=CLUHART00000008717:tss;Parent=CLUHART00000008717
scaffold625 maker start_codon 337916 337918 . + . ID=CLUHART00000008717:start;Parent=CLUHART00000008717
scaffold625 maker CDS 337915 337971 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 340733 340841 . + 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341518 341628 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker CDS 341964 343033 . + 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 maker stop_codon 343031 343033 . + . ID=CLUHART00000008717:stop;Parent=CLUHART00000008717
scaffold625 maker exon 337818 337971 . + . ID=CLUHART00000008717:exon1;Parent=CLUHART00000008717
scaffold625 maker exon 340733 340841 . + . ID=CLUHART00000008717:exon2;Parent=CLUHART00000008717
scaffold625 maker exon 341518 341628 . + . ID=CLUHART00000008717:exon3;Parent=CLUHART00000008717
scaffold625 maker exon 341964 343277 . + . ID=CLUHART00000008717:exon4;Parent=CLUHART00000008717
scaffold625 maker five_prime_utr 337818 337914 . + . ID=CLUHART00000008717:five_prime_utr;Parent=CLUHART00000008717
scaffold625 maker three_prime_UTR 343034 343277 . + . ID=CLUHART00000008717:three_prime_utr;Parent=CLUHART00000008717
AGAT
AGAT v0.2.2
agat_convert_sp_gff2bed.pl --gff 1_test.gff -o 1_test_agat.bed
scaffold625 337817 343277 CLUHART00000008717 0 + 337914 343033 255,0,0 4 154,109,111,1314 0,2915,3700,4146
PASA
PASA pasa-v2.4.1
./PASApipeline/misc_utilities/gff3_file_to_bed.pl test_1.gff > 1_test_transdecoder.bed
#gffTags
scaffold625 337817 343277 ID=CLUHART00000008717;CLUHARG00000005458;TUBB3_2 0 + 337914 343033 0 4 154,109,111,1314 0,2915,3700,4146
bedops
version: 2.4.37
gff2bed < 1_test.gff > 1_test_bedops.bed
scaffold625 337817 337914 CLUHART00000008717:five_prime_utr . + maker five_prime_utr . ID=CLUHART00000008717:five_prime_utr;Parent=CLUHART00000008717
scaffold625 337817 337971 CLUHART00000008717:exon1 . + maker exon . ID=CLUHART00000008717:exon1;Parent=CLUHART00000008717
scaffold625 337817 343277 CLUHARG00000005458 . + maker gene . ID=CLUHARG00000005458;Name=TUBB3_2
scaffold625 337817 343277 CLUHART00000008717 . + maker mRNA . ID=CLUHART00000008717;Parent=CLUHARG00000005458
scaffold625 337914 337971 CLUHART00000008717:cds . + maker CDS 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 337915 337918 CLUHART00000008717:start . + maker start_codon . ID=CLUHART00000008717:start;Parent=CLUHART00000008717
scaffold625 337915 337918 CLUHART00000008717:tss . + maker tss . ID=CLUHART00000008717:tss;Parent=CLUHART00000008717
scaffold625 340732 340841 CLUHART00000008717:cds . + maker CDS 0 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 340732 340841 CLUHART00000008717:exon2 . + maker exon . ID=CLUHART00000008717:exon2;Parent=CLUHART00000008717
scaffold625 341517 341628 CLUHART00000008717:cds . + maker CDS 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 341517 341628 CLUHART00000008717:exon3 . + maker exon . ID=CLUHART00000008717:exon3;Parent=CLUHART00000008717
scaffold625 341963 343033 CLUHART00000008717:cds . + maker CDS 2 ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625 341963 343277 CLUHART00000008717:exon4 . + maker exon . ID=CLUHART00000008717:exon4;Parent=CLUHART00000008717
scaffold625 343030 343033 CLUHART00000008717:stop . + maker stop_codon . ID=CLUHART00000008717:stop;Parent=CLUHART00000008717
scaffold625 343033 343277 CLUHART00000008717:three_prime_utr . + maker three_prime_UTR . ID=CLUHART00000008717:three_prime_utr;Parent=CLUHART00000008717
Kent utils
version from 26-Feb-2020
./gff3ToGenePred.dms 1_test.gff temp.genePred
./genePredToBed.dms temp.genePred 1_test_genePred.bed
scaffold625 337817 343277 CLUHART00000008717 0 + 337914 343033 0 4 154,109,111,1314, 0,2915,3700,4146,
The bed format
Detailed information can be found here: https://genome.ucsc.edu/FAQ/FAQformat.html
Below a description of the different fields:
column | feature type | mandatory | comment |
---|---|---|---|
1 | chrom | X | The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671). |
2 | chromStart | X | The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0. |
3 | chromEnd | X | The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. |
4 | name | Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode. | |
5 | score | A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). | |
6 | strand | Defines the strand - either '+' or '-'. | |
7 | thickStart | The starting position at which the feature is drawn thickly | |
8 | thickEnd | The ending position at which the feature is drawn thickly | |
9 | itemRgb | An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser. | |
10 | blockCount | The number of blocks (exons) in the BED line. | |
11 | blockSizes | A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount. | |
12 | blockStarts | A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount. |
/!\ location BED format is 0-based, half-open [start-1, end), while GFF is 1-based, closed [start, end].