How can I convert a BED file to GTF/GFF with gene_ids?

Question

Given a .bed (BED12), how can I convert it to GTF/GFF formats with gene_id attributes? What is the fastest way or available tools to do it?

For example, given an input like this:

chr27 17266469 17281218 ENST00000541931.8 1000 + 17266469 17281218 0,0,200 2 103,74, 0,14675,

How can I convert it to:

chr27 bed2gtf gene 17266470 17285418 . + . gene_id "ENSG00000151743";
chr27 bed2gtf transcript 17266470 17281218 . + . gene_id "ENSG00000151743"; transcript_id "ENST00000541931.8";
chr27 bed2gtf exon 17266470 17266572 . + . gene_id "ENSG00000151743"; transcript_id "ENST00000541931.8"; exon_number "1"; exon_id "ENST00000541931.8.1";

score 6 · Accepted Answer · edited Dec 05 '23 at 09:57

Here I provide an ordered list of options (note that I am the author of bed2gtf and bed2gff):

bed2gtf

A high-performance BED-to-GTF converter written in Rust from https://github.com/alejandrogzi/bed2gtf.

Usage: bed2gtf[EXE] --bed/-b <BED> --isoforms/-i <ISOFORMS> --output/-o <OUTPUT> --threads/-t <THREADS>
where:
--bed <BED>: a .bed file
--isoforms <ISOFORMS>: a tab-delimited file
--output <OUTPUT>: path to output file (*.gtf)

The isoforms file specification:

a tab-delimited .txt/.tsv/.csv/... file with genes/isoforms (all the transcripts in .bed file should appear in the isoforms file):

> cat isoforms.txt
ENSG00000198888 ENST00000361390
ENSG00000198763 ENST00000361453
ENSG00000198804 ENST00000361624
ENSG00000188868 ENST00000595977

Converts

Homo sapiens GRCh38 GENCODE 44 (252,835 transcripts) in 3.25 seconds.
Mus musculus GRCm39 GENCODE 44 (149,547 transcritps) in 1.99 seconds.
Canis lupus familiaris ROS_Cfam_1.0 Ensembl 110 (55,335 transcripts) in 1.20 seconds.
Gallus galus bGalGal1 Ensembl 110 (72,689 transcripts) in 1.36 seconds.

bed2gff

A Rust BED-to-GFF3 translator that runs in parallel from https://github.com/alejandrogzi/bed2gff.

Usage: bed2gff[EXE] --bed/-b <BED> --isoforms/-i <ISOFORMS> --output/-o <OUTPUT> --threads/-t <THREADS>
where:
--bed <BED>: a .bed file
--isoforms <ISOFORMS>: a tab-delimited file
--output <OUTPUT>: path to output file (*.gff)

The isoforms file specification:

a tab-delimited .txt/.tsv/.csv/... file with genes/isoforms (all the transcripts in .bed file should appear in the isoforms file):

> cat isoforms.txt
ENSG00000198888 ENST00000361390
ENSG00000198763 ENST00000361453
ENSG00000198804 ENST00000361624
ENSG00000188868 ENST00000595977

Convert

Homo sapiens GRCh38 GENCODE 44 (252,835 transcripts) in 4.16 seconds.
Mus musculus GRCm39 GENCODE 44 (149,547 transcritps) in 2.15 seconds.
Canis lupus familiaris ROS_Cfam_1.0 Ensembl 110 (55,335 transcripts) in 1.30 seconds.
Gallus gallus bGalGal1 Ensembl 110 (72,689 transcripts) in 1.51 seconds.

bedToGenePred + genePredToGtf + refTable

UCSC offers a fast way to convert BED into GTF files through KentUtils or specific binaries using:

bedToGenePred in.bed /dev/stdout | genePredToGtf file /dev/stdin out.gtf

You can install these tools with bioconda, or download them here. The gene_id is only achieved when using refTables (a format specified in UCSC's web browser), you can see a more elaborate answer here Obtaining Ucsc Tables Via Ftp And Converting Them To Proper Gff3 Via Genepredtogtf?.

Other options

Other scripts/tools That DO NOT produce a complete GTF file (lacking gene_id attributes) are:

gtf2bed

gtf2bed < foo.gtf | sort-bed - > foo.bed 
awk '{print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10)))}' foo.bed > foo_from_gtf2bed.gtf

-kscript from https://github.com/holgerbrandl/kscript:

kscript https://git.io/vbJ4B my.bed > my.gtf

pfurio/bed2gtf

from https://github.com/pfurio/bed2gtf:

python bed2gtf [options] <mandatory>

AGAT

AGAT

Considering only the options that produce gene_ids attributes, bed2gtf and bed2gff are faster by ~3-4 seconds than UCSC's C binaries. More detailed instructions of this tools are explained in the sources linked.

Thank you. These are great resources. – gringer Dec 03 '23 at 07:14 — gringer, Dec 03 '23 at 07:14

How can I convert a BED file to GTF/GFF with gene_ids?

1 Answers1

bed2gtf

bed2gff

bedToGenePred + genePredToGtf + refTable

Other options