3

For the two vcf files linked below, I cannot find any variants in the GT column other than "./.". Is it possible to confirm if the GT column of vcf files have been annotated (i.e variants listed as "./1", "1/." or "1/1")?

vcf files: (https://zenodo.org/records/6558593, LUD.TH179.PASS.dbsnp_cosmic.vcf.gz and LUD.TH238.PASS.dbsnp_cosmic.vcf.gz).

here are a few terminal commands I have tried. The output for the commands below are all 0, with no variant information under the columns of the outputted vcf data. I would assume, it means that the GT column was not annotated, but I am not sure.

  1. remove homozygous ref genotype from multi-sample vcf
  • bcftools view -i 'GT[*]="alt"' LUD.TH179.PASS.dbsnp_cosmic.vcf.gz | less -SN
  1. extracting heterozygous snp from a vcf file
  • vcftools --gzvcf LUD.TH179.PASS.dbsnp_cosmic.vcf.gz --extract-FORMAT-info GT | grep "0/1"
  1. ID heterozygous variants in VCF file using vcftools
  • vcftools --gzvcf LUD.TH179.PASS.dbsnp_cosmic.vcf.gz --het
  • output (cat out.het)
    • INDV O(HOM) E(HOM) N_SITES F (EMPTY)
CoderQ
  • 33
  • 3
  • Can you clarify the issue please? Yes, it looks like this vcf only has ungenotyped variants. What exactly are you asking? – terdon Feb 26 '24 at 12:28
  • @coderQ please carefully consider the answer below: I think you'll find it works, then please consider upvoting and marking as "accepted" (it helps everyone). – M__ Feb 26 '24 at 21:42

1 Answers1

2

I checked one of your files, LUD.TH179.PASS.dbsnp_cosmic.vcf.gz, and can confirm that none of the samples listed in the VCF have a genotype for any of the variants. This isn't surprising, it is often hard if not impossible to genotype variants from somatic samples since you cannot know if the, for example, 10% allelic balance you see means that this is homozygous in 10% of the cells or heterozygous in 5% or any other combination of the two.

In any case, this is how I confirmed that no variants were genotyped. First select all lines that are not headers (do not start with #), then fields 10 to the end of the line, convert tabs to newline characters (this leaves me with the sample fields, e.g. ./.:.:.:.:.:.:.), then cut at the first : to get the genotype, sort and count:

$ zgrep -v '^#' LUD.TH179.PASS.dbsnp_cosmic.vcf.gz | cut -f 10- | 
   tr '\t' '\n' | cut -d: -f1 | sort | uniq -c
537123250 ./.

As you can see above, the result was 537123250 sample/variant combinations, all of whose genotype was ./..

terdon
  • 10,071
  • 5
  • 22
  • 48