GFF to BED conversion

It exists many GFF formats and many GTF formats (see here for a complete review) and many tools to perform the conversion. We will try to see in this review the main differences.

Table of Contents

Test summary

tool Comment
AGAT default RGB color to 255,0,0
PASA Particular 3rd column that contains a list of names
bedops each gff feature give one line. Only the 6 first colums are correct
Kent utils extra coma at the end of 11th and 12th column

The GFF file to convert

The test file is a GFF3 file:

##gff-version 3
# This is a test sample
scaffold625	maker	gene	337818	343277	.	+	.	ID=CLUHARG00000005458;Name=TUBB3_2
scaffold625	maker	mRNA	337818	343277	.	+	.	ID=CLUHART00000008717;Parent=CLUHARG00000005458
scaffold625	maker	tss	337916	337918	.	+	.	ID=CLUHART00000008717:tss;Parent=CLUHART00000008717
scaffold625	maker	start_codon	337916	337918	.	+	.	ID=CLUHART00000008717:start;Parent=CLUHART00000008717
scaffold625	maker	CDS	337915	337971	.	+	0	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	maker	CDS	340733	340841	.	+	0	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	maker	CDS	341518	341628	.	+	2	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	maker	CDS	341964	343033	.	+	2	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	maker	stop_codon	343031	343033	.	+	.	ID=CLUHART00000008717:stop;Parent=CLUHART00000008717
scaffold625	maker	exon	337818	337971	.	+	.	ID=CLUHART00000008717:exon1;Parent=CLUHART00000008717
scaffold625	maker	exon	340733	340841	.	+	.	ID=CLUHART00000008717:exon2;Parent=CLUHART00000008717
scaffold625	maker	exon	341518	341628	.	+	.	ID=CLUHART00000008717:exon3;Parent=CLUHART00000008717
scaffold625	maker	exon	341964	343277	.	+	.	ID=CLUHART00000008717:exon4;Parent=CLUHART00000008717
scaffold625	maker	five_prime_utr	337818	337914	.	+	.	ID=CLUHART00000008717:five_prime_utr;Parent=CLUHART00000008717
scaffold625	maker	three_prime_UTR	343034	343277	.	+	.	ID=CLUHART00000008717:three_prime_utr;Parent=CLUHART00000008717

AGAT

AGAT v0.2.2

agat_convert_sp_gff2bed.pl --gff 1_test.gff -o 1_test_agat.bed

scaffold625	337817	343277	CLUHART00000008717	0	+	337914	343033	255,0,0	4	154,109,111,1314	0,2915,3700,4146

PASA

PASA pasa-v2.4.1

./PASApipeline/misc_utilities/gff3_file_to_bed.pl test_1.gff > 1_test_transdecoder.bed

#gffTags
scaffold625	337817	343277	ID=CLUHART00000008717;CLUHARG00000005458;TUBB3_2	0	+	337914	343033	0	4	154,109,111,1314	0,2915,3700,4146

bedops

version: 2.4.37

gff2bed < 1_test.gff > 1_test_bedops.bed

scaffold625	337817	337914	CLUHART00000008717:five_prime_utr	.	+	maker	five_prime_utr	.	ID=CLUHART00000008717:five_prime_utr;Parent=CLUHART00000008717
scaffold625	337817	337971	CLUHART00000008717:exon1	.	+	maker	exon	.	ID=CLUHART00000008717:exon1;Parent=CLUHART00000008717
scaffold625	337817	343277	CLUHARG00000005458	.	+	maker	gene	.	ID=CLUHARG00000005458;Name=TUBB3_2
scaffold625	337817	343277	CLUHART00000008717	.	+	maker	mRNA	.	ID=CLUHART00000008717;Parent=CLUHARG00000005458
scaffold625	337914	337971	CLUHART00000008717:cds	.	+	maker	CDS	0	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	337915	337918	CLUHART00000008717:start	.	+	maker	start_codon	.	ID=CLUHART00000008717:start;Parent=CLUHART00000008717
scaffold625	337915	337918	CLUHART00000008717:tss	.	+	maker	tss	.	ID=CLUHART00000008717:tss;Parent=CLUHART00000008717
scaffold625	340732	340841	CLUHART00000008717:cds	.	+	maker	CDS	0	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	340732	340841	CLUHART00000008717:exon2	.	+	maker	exon	.	ID=CLUHART00000008717:exon2;Parent=CLUHART00000008717
scaffold625	341517	341628	CLUHART00000008717:cds	.	+	maker	CDS	2	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	341517	341628	CLUHART00000008717:exon3	.	+	maker	exon	.	ID=CLUHART00000008717:exon3;Parent=CLUHART00000008717
scaffold625	341963	343033	CLUHART00000008717:cds	.	+	maker	CDS	2	ID=CLUHART00000008717:cds;Parent=CLUHART00000008717
scaffold625	341963	343277	CLUHART00000008717:exon4	.	+	maker	exon	.	ID=CLUHART00000008717:exon4;Parent=CLUHART00000008717
scaffold625	343030	343033	CLUHART00000008717:stop	.	+	maker	stop_codon	.	ID=CLUHART00000008717:stop;Parent=CLUHART00000008717
scaffold625	343033	343277	CLUHART00000008717:three_prime_utr	.	+	maker	three_prime_UTR	.	ID=CLUHART00000008717:three_prime_utr;Parent=CLUHART00000008717

Kent utils

version from 26-Feb-2020

./gff3ToGenePred.dms 1_test.gff temp.genePred ./genePredToBed.dms temp.genePred 1_test_genePred.bed

scaffold625	337817	343277	CLUHART00000008717	0	+	337914	343033	0	4	154,109,111,1314,	0,2915,3700,4146,

The bed format

Detailed information can be found here: https://genome.ucsc.edu/FAQ/FAQformat.html
Below a description of the different fields:

column feature type mandatory comment
1 chrom X The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
2 chromStart X The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
3 chromEnd X The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
4 name Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.
5 score A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray).
6 strand Defines the strand - either '+' or '-'.
7 thickStart The starting position at which the feature is drawn thickly
8 thickEnd The ending position at which the feature is drawn thickly
9 itemRgb An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.
10 blockCount The number of blocks (exons) in the BED line.
11 blockSizes A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
12 blockStarts A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

/!\ location BED format is 0-based, half-open [start-1, end), while GFF is 1-based, closed [start, end].

_images/coordinate_systems.jpg