Tackling post-Michigan Imputation Server troubles

Background

When meta-analysing many genome-wide association studies a lack of overlap [in terms of variants] between genotyping platforms lowers power. However, imputation of missing genotypes, i.e. untyped across all individuals or missing in a few, using HapMap 2, 1000G, Haplotype Reference Consortium (HRC), or Genome of the Netherlands (GoNL5) as reference ensures >99% overlap and significantly increases power.

As cohort sizes increase, the latest development is cloud-based imputation using phased data from 1000G phase 3 or HRC as a reference, such as available at the Michigan Imputation Server. Within a matters of hours - not taking the queuing into account - anyone can easily impute against high quality reference producing community standardised output files.

The backbone is an updated version of Minimac (Minimac3) and the introduction of a new VCF-like file-format which significantly reduces the (zipped) file sizes. While this file-format is in principle VCF-compatible (v4.2+), in reality the user would want to have Oxford-format or PLINK-format files, which can be used with SNPTEST 2.5+ or PLINK v1.9 beta+, respectively.
Many other programs, such PRSice or LDPred, also expect Oxford- or PLINK-formatted files.

So, you too, might want to convert. :-)

Step-by-step

  • First thing you need to download is PLINK 2 alpha (so not v1.9!), which can be found here. In my case, being on macOS High Sierra:

   mkdir -v ~/bin    cd ~/bin    wget http://s3.amazonaws.com/plink2-assets/plink2_mac_20180709.zip    unzip plink2_mac_20180709.zip 
  • The M3VCF is VCF-like, so sligtly different from what you would expect. Here is the head of part of one of mine:

##fileformat=VCFv4.2
##FORMAT=
##FORMAT=
##FORMAT=
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE_1
6 89961 6:89961,. G A … DS:GP:GT 0:1,0,0:0|0
6 100100 6:100100,. G T … DS:GP:GT 0:1,0,0:0|0
6 100112 6:100112,. C A … DS:GP:GT 0:1,0,0:0|0
6 145579 6:145579,. G A … DS:GP:GT 0:1,0,0:0|0
6 147651 6:147651,. A C … DS:GP:GT 0.001:0.999,0.001,0:0|0

Because of the slightly different file-format, a datamanagement-program like QCTOOL can read and even merge files like this, but converting to another format (be it .bgen, .gen, or .vcf) it will not.

The alternative is to use PLINK 2 alpha. In my case processing all 22 autosomal chromosomes would be done with this command:

   for i in $(seq 1 22); do     	plink2alpha --vcf cohort1.1kgp3.chr${i}.dose.vcf.gz dosage=GP --export oxford --out cohort1.1kgp3.chr${i};     done 

To produce .gen and accompanying .sample files with genotype probabilities (tagged by GP) it is critically to include the flag dosage=GP.

And that’s it, after some hours, depending on your system and the size of the cohort, VCF-files are converted to your favourite file-format-flavour.

Previous
Previous

A story about #ExpressScan

Next
Next

Ewan Birney – On Genetics as a whole, and PRSs, and Robert Plomin’s book