Tackling post-Michigan Imputation Server troubles
Background
When meta-analysing many genome-wide association studies a lack of overlap [in terms of variants] between genotyping platforms lowers power. However, imputation of missing genotypes, i.e. untyped across all individuals or missing in a few, using HapMap 2, 1000G, Haplotype Reference Consortium (HRC), or Genome of the Netherlands (GoNL5) as reference ensures >99% overlap and significantly increases power.
As cohort sizes increase, the latest development is cloud-based imputation using phased data from 1000G phase 3 or HRC as a reference, such as available at the Michigan Imputation Server. Within a matters of hours - not taking the queuing into account - anyone can easily impute against high quality reference producing community standardised output files.
The backbone is an updated version of Minimac (Minimac3) and the introduction of a new VCF-like file-format which significantly reduces the (zipped) file sizes. While this file-format is in principle VCF-compatible (v4.2+), in reality the user would want to have Oxford-format or PLINK-format files, which can be used with SNPTEST 2.5+ or PLINK v1.9 beta+, respectively.
Many other programs, such PRSice or LDPred, also expect Oxford- or PLINK-formatted files.
So, you too, might want to convert. :-)
Step-by-step
First thing you need to download is
PLINK 2 alpha
(so not v1.9!), which can be found here. In my case, being on macOS High Sierra:
mkdir -v ~/bin cd ~/bin wget http://s3.amazonaws.com/plink2-assets/plink2_mac_20180709.zip unzip plink2_mac_20180709.zip
The M3VCF is VCF-like, so sligtly different from what you would expect. Here is the
head
of part of one of mine:
##fileformat=VCFv4.2
##FORMAT=
##FORMAT=
##FORMAT=
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE_1
6 89961 6:89961,. G A … DS:GP:GT 0:1,0,0:0|0
6 100100 6:100100,. G T … DS:GP:GT 0:1,0,0:0|0
6 100112 6:100112,. C A … DS:GP:GT 0:1,0,0:0|0
6 145579 6:145579,. G A … DS:GP:GT 0:1,0,0:0|0
6 147651 6:147651,. A C … DS:GP:GT 0.001:0.999,0.001,0:0|0
Because of the slightly different file-format, a datamanagement-program like QCTOOL can read and even merge files like this, but converting to another format (be it .bgen
, .gen
, or .vcf
) it will not.
The alternative is to use PLINK 2 alpha
. In my case processing all 22 autosomal chromosomes would be done with this command:
for i in $(seq 1 22); do plink2alpha --vcf cohort1.1kgp3.chr${i}.dose.vcf.gz dosage=GP --export oxford --out cohort1.1kgp3.chr${i}; done
To produce .gen
and accompanying .sample
files with genotype probabilities (tagged by GP) it is critically to include the flag dosage=GP
.
And that’s it, after some hours, depending on your system and the size of the cohort, VCF-files are converted to your favourite file-format-flavour.