3

Given a .bed (BED12), how can I convert it to GTF/GFF formats with gene_id attributes? What is the fastest way or available tools to do it?

For example, given an input like this:

chr27 17266469 17281218 ENST00000541931.8 1000 + 17266469 17281218 0,0,200 2 103,74, 0,14675,

How can I convert it to:

chr27 bed2gtf gene 17266470 17285418 . + . gene_id "ENSG00000151743";
chr27 bed2gtf transcript 17266470 17281218 . + . gene_id "ENSG00000151743"; transcript_id "ENST00000541931.8";
chr27 bed2gtf exon 17266470 17266572 . + . gene_id "ENSG00000151743"; transcript_id "ENST00000541931.8"; exon_number "1"; exon_id "ENST00000541931.8.1";
terdon
  • 10,071
  • 5
  • 22
  • 48

1 Answers1

6

Here I provide an ordered list of options (note that I am the author of bed2gtf and bed2gff):

bed2gtf

A high-performance BED-to-GTF converter written in Rust from https://github.com/alejandrogzi/bed2gtf.

Usage: bed2gtf[EXE] --bed/-b <BED> --isoforms/-i <ISOFORMS> --output/-o <OUTPUT> --threads/-t <THREADS>

where: --bed <BED>: a .bed file --isoforms <ISOFORMS>: a tab-delimited file --output <OUTPUT>: path to output file (*.gtf)

The isoforms file specification:

a tab-delimited .txt/.tsv/.csv/... file with genes/isoforms (all the transcripts in .bed file should appear in the isoforms file):

> cat isoforms.txt

ENSG00000198888 ENST00000361390 ENSG00000198763 ENST00000361453 ENSG00000198804 ENST00000361624 ENSG00000188868 ENST00000595977

Converts

  • Homo sapiens GRCh38 GENCODE 44 (252,835 transcripts) in 3.25 seconds.
  • Mus musculus GRCm39 GENCODE 44 (149,547 transcritps) in 1.99 seconds.
  • Canis lupus familiaris ROS_Cfam_1.0 Ensembl 110 (55,335 transcripts) in 1.20 seconds.
  • Gallus galus bGalGal1 Ensembl 110 (72,689 transcripts) in 1.36 seconds.

bed2gff

A Rust BED-to-GFF3 translator that runs in parallel from https://github.com/alejandrogzi/bed2gff.

Usage: bed2gff[EXE] --bed/-b <BED> --isoforms/-i <ISOFORMS> --output/-o <OUTPUT> --threads/-t <THREADS>

where: --bed <BED>: a .bed file --isoforms <ISOFORMS>: a tab-delimited file --output <OUTPUT>: path to output file (*.gff)

The isoforms file specification:

a tab-delimited .txt/.tsv/.csv/... file with genes/isoforms (all the transcripts in .bed file should appear in the isoforms file):

> cat isoforms.txt

ENSG00000198888 ENST00000361390 ENSG00000198763 ENST00000361453 ENSG00000198804 ENST00000361624 ENSG00000188868 ENST00000595977

Convert

  • Homo sapiens GRCh38 GENCODE 44 (252,835 transcripts) in 4.16 seconds.
  • Mus musculus GRCm39 GENCODE 44 (149,547 transcritps) in 2.15 seconds.
  • Canis lupus familiaris ROS_Cfam_1.0 Ensembl 110 (55,335 transcripts) in 1.30 seconds.
  • Gallus gallus bGalGal1 Ensembl 110 (72,689 transcripts) in 1.51 seconds.

bedToGenePred + genePredToGtf + refTable

UCSC offers a fast way to convert BED into GTF files through KentUtils or specific binaries using:

bedToGenePred in.bed /dev/stdout | genePredToGtf file /dev/stdin out.gtf

You can install these tools with bioconda, or download them here. The gene_id is only achieved when using refTables (a format specified in UCSC's web browser), you can see a more elaborate answer here Obtaining Ucsc Tables Via Ftp And Converting Them To Proper Gff3 Via Genepredtogtf?.

Other options

Other scripts/tools That DO NOT produce a complete GTF file (lacking gene_id attributes) are:

  • gtf2bed
gtf2bed < foo.gtf | sort-bed - > foo.bed 
awk '{print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10)))}' foo.bed > foo_from_gtf2bed.gtf

-kscript from https://github.com/holgerbrandl/kscript:

kscript https://git.io/vbJ4B my.bed > my.gtf
  • pfurio/bed2gtf

from https://github.com/pfurio/bed2gtf:

python bed2gtf [options] <mandatory>
  • AGAT

AGAT

Considering only the options that produce gene_ids attributes, bed2gtf and bed2gff are faster by ~3-4 seconds than UCSC's C binaries. More detailed instructions of this tools are explained in the sources linked.

terdon
  • 10,071
  • 5
  • 22
  • 48