The datasets consist of samples genotyped on different Illumina arrays: 610K; 650K; 660K and 1M (see below for details). Each dataset includes maximum number of SNPs in common between the genotyping chips used for the study. We pooled all data and then extracted respective individuals with PLINK using filter –geno 0.05. This is the only filter used. So you can QC as you need. For the purpose of minimizing strand issues all At and GC markers were removed. As the last step in the merger we used the b37 positions to obtain strand from the 1000G data and flipped strand to match that of 1000G. In order to be most clear on how we arrived at these data I describe the process of lifting the data to Build37 in more detail than necessary for most of you :) Taking the individual files as generated below the data was merged in the following order 730K to 1M to 730K(second set) to 610K to 660K to 650K A few additional strand flips were necessary at each step. Importantly, because PLINK merges monomorphic markers from different strands incorrectly (0-C 0-G gets to be merged as G-C and no strand conflict is reported), we flipped the strand of such (with strand conflict checked for outside of PLINK) monomorphic markers in the dataset to be merged prior to merger. 730K Data (first set) Genome Studio report from FtTreeDNA. In-house script to PED. Flipped strand for RefStrand minuses in manifest (HumanOmniExpress-12v1_C.csv). According to manifest both positions and rsNumbers should be b37 and dbSNP b131 respectively. Removed 1734 markers on physical position 0 Initial merge with master data (the above lifted 6x0K+1M) found 2795 SNPs that do not match in terms of allele codes. Flipped those. Also 8 markers with different chromosome code. rs17728665 was on chr 21 but should be on chr2 and be rs2228443. Is now rs14115 was on chr22 but should be chrX (chr23). Is now rs9786841 was on chr24 but should be chr7. Is now rs13305148 was on chr24 but should be chr5. Is now rs13303653 was on chr24 but should be chr4. Is now rs34095730 was on chr24 but should be chr7. Is now rs35842692 was on chr25 but should be chr24. Is now rs35361563 was on chr25 but should be chr24. Is now 1M 1. Genotypes were generated in GenomeStudio using latest manifest HumanOmni1-Quad_v1-0_H.bpm. 2. Genome studio report was converted into PED by in-house perl script using the PLUS strand (thus corresponding to Hapmap). 3. All markers from the minus strand were flipped using PLINK (fliplist <– marker has ”-“ in manifest column RefStrand). Converted PED to BED while flipping. 4. Added sex by impute-sex in PLINK (4 samples were left with undefined sex) 5. Removed CNVs 6. Removed markers on chr0 (the ones that have been removed by illumine since first manifest for the chip) 7. Removed markers with zero call rate (28238) 8. Used SNAP to get dbSNP b131 versions of all rs-numbers (works only for RS-numbers, does not change other SNP_IDs) 1123 of them in total.Updated in bim. 9. This created 10 duplicates (rs10020631, rs1150767, rs11568391, rs11568509, rs11568659, rs1649933, rs17583782, rs2228397, rs9258315, rs4987020). Removed one from each pair according to call rate or arbitrarily if call rate was equal. rs4987020 occurred on chr 11 and 17 – chr17 was the new one and thus kept that one. 10. PLINK merge reported 165 strand conflicts when merging with 610K_660K data. Thus flipped strand for these markers. 11. 196 monomorphic markers were on different strand from 730Kdata. flipped those. 12. PLINK merge reported 1593 strand conflicts when merging with 730K data. Thus flipped strand for these markers. 730K Data (second set 8 samples) Genome Studio report from FtTreeDNA. In-house script to PED. Flipped strand for RefStrand minuses in manifest (HumanOmniExpress-12v1_C.csv). According to manifest both positions and rsNumbers should be b37 and dbSNP b131 respectively. Removed 1230 markers on chr0 Initial merge with master data (the above lifted 730K+1M) found 2777 SNPs that do not match in terms of allele codes. Flipped those. 14 monomorphic markers were on different strand from previous merge. flipped those. 610K 1. Genotypes were generated in GenomeStudio using latest manifest Human610-Quadv1_H (which has Build37 physical positions but old rsNumbers). 2. Genome studio report was converted into PED by in-house perl script using the PLUS strand (thus corresponding to Hapmap). 3. All markers from the minus strand were flipped using PLINK (fliplist <– marker has “–“ (minus) in manifest column RefStrand). Converted PED to BED while flipping. 4. Added sex by impute-sex in PLINK (31 samples were left with undefined sex) 5. Removed CNVs 6. Removed markers on chr0 (the ones that have been removed by illumina since first manifest for the chip) 7. Removed markers with zero call rate (5103). 8. Used SNAP to get dbSNP b131 versions of all rs-numbers (works only for RS-numbers, does not change other SNP_IDs) 509 of them in total. Updated those in bim. This created one double marker (rs12643283) both instances of which had 100% call rate. Removed one arbitrarily. 9. Flipped strand for 6 markers (5 of them monomorphic) to merge with previous merge (see merge order above) 660K 1. Genotypes were generated in GenomeStudio using latest manifest Human660-Quadv1_H (which has Build37 physical positions but old rsNumbers). 2. Genome studio report was converted into PED by in-house perl script using the PLUS strand (thus corresponding to Hapmap). 3. All markers from the minus strand were flipped using PLINK (fliplist – marker has ”-“ in manifest column RefStrand. Converted PED to BED while flipping. 4. Added sex by impute-sex in PLINK (67 samples were left with undefined sex) 5. Removed CNVs 6. Removed markers on chr0 (the ones that have been removed by illumine since first manifest for the chip) 7. Removed markers with zero call rate (28708) 8. Used SNAP to get dbSNP b131 versions of all rs-numbers (works only for RS-numbers, does not change other SNP_IDs) 444 of them in total. Updated those in bim. No doubles created. 9. Flipped strand for 3 markers (2 of them monomorphic) to merge with previous merge (see merge order above) 650K We got instructions from Illumina as follows: Please find enclosed below a link to download the strand translation files for HumanHap650Yv3 array: https://fts.illumina.com/seos/1000/mpd/ui22022013ea6ec1f2a555a152fdf9e07c7e240a5b Here is a short description of the content of the files: -Strand Translation File: The strand translation file takes the information from our current manifest and report the following: The purpose of this file is to allow a user to correlate strand information and translate genotypes in their results files accordingly. -Strand Tranlsation File .forward_strand_flip_list file is a subset of the strand translation file, which reports only those SNPs that will need to flipped in a forward strand genotype report (from Genome Studio) to obtain a plus strand genotype report. -Strand Translation File SNPs no Strand Info file is a subset of the strand translation file, which reports all the SNPs for which we were unable to obtain Genome Assembly strand information using Mega Blast. These markers should probably be dropped from any comparison that requires translation of genotypes in plus strand orientation. 1. PED -> BED file was generated from GS report using Forward strand (there is no Plus strand info there). 2. Kept only markers present in “Strand Translation File” 3. Flipped strand for markers in Strand Tranlsation File .forward_strand_flip_list 4. Excluded markers in “no_strand” file – but non excluded since they are already excluded from the “Strand Translation File” 5. Updated rsNumbers via SNAP to b131 (528 markers) 6. Used UCSC liftover tool to lift Build36 physical positions to Build37. 165 positions reported as “deleted in new”. Omitted those from bim file. Updated positions for the rest 7. Physical position 141932985 was associated to two rs-numbers - the one on chr7 is reported as deleted in new in UCSC. dbSNP has different position for rs7578144 and that one is actually used downstream. chr 2 rs7578144 chr7 rs16622 - OUT 8. On first attempt of merge there were 298607 strand problems. Flipped strand for those. 9. In addition there were 9 cases of “Warning: different chromosome” and “Warning: different physical position” rs12043679 rs13002544 rs2342694 rs2569201 rs17863175 rs1578263 rs13285529 rs2228443 rs14115 Removed those from 650K dataset. 10. In addition there were 20 “Warning: different physical position” rs4433978 rs7595643 rs4664277 rs2011207 rs11743665 rs4256345 rs9460309 rs11967812 rs7755116 rs9497402 rs2249255 rs2155163 rs9507310 rs1475276 rs12435258 rs12433837 rs7824 rs1017238 rs3819263 rs9624480 Kept those. They will get updated physical positions from the 660kX610KX1M dataset where positions are b37 (by illumina). The RsNumbers were not among those affected by SNAP. 11. Second attempt at merge revealed that rs1113572 is AG in 650K and AC in newer stuff. Removed this marker from 650Kdata. 12. While finally merging with the previous merge (see merge order above) PLINK saw 3 strand conflicts but there were in addition 5263 monomorphic markers from the other strand. Flipped all these.