============================================================== PreGSF90 - a program to process the genomic information for the BLUPF90 family of programs ============================================================== Ignacio Aguilar and Ignacy Misztal, University of Georgia email: iaguilar at inia.org.uy; ignacy at uga.edu 01/29/09 - 09/13/11 Summary ======= Program PreGSF90 helps implement the genomic selection following the single-step methodology as presented by Aguilar et al. in Journal of Dairy Science 2010:(93)743-752. In this methodology the relationship matrix A based on the pedigree information is replaced by matrix H, which combines the pedigree and genomic information. The main difference between inv(A) and inv(H) is matrix of structure GimA22i=inv(G)-inv(A22), where G is a genomic relationship matrix and A22 is a relationship matrix for genotyped animals. Programs PreGSF90 is run after Renumf90. It is also run automatically by application programs like BLUPF90, REMLF90, GIBBSxF90 or BLUP90IOD when their parameter file contains OPTION for SNP_FILE. Input files =========== parameter file as created by RENUMF90 with genotype file specified with keyword SNP_FILE Genotype file Field 1 - animal ID with format as in pedigree file Field 2 - genotype with 0,1,2 and 5 (missing) or real values for gene content 0.12 ... Fields need to be separated by at least one space and Field 2 should be fixed format (nI1 or nF4.2). Renumbered ID for genotypes File gt.XrefId, where "gt" is genotype file, contains IDs renumberd seqentially. This file is created by RENUMF90. Optionally, the program can read allele frequencies, G or its inverse, A22 or its inverse, etc,. as specified by respective OPTION lines. Output files ============ GimA22i Contains GimA22i in binary for use by later programs, With option, this file can be stored as ASCII. freqdata.count contains gene frequencies if requested Optionally, the program can store files such as G or its inverse, A22 or its inverse, etc,. as specified by respective OPTION lines. Options for G ============== The genomic relationship matrix can be created in different ways. OPTION whichG x specifies how G is created. The variable x can be: 1: G=ZZ'/k ; as in VanRaden 2008 (default) 2: Amin et al; Leuttenger et al; using MMP subroutines with Z = M*sqrt(w) => ZWZ'/k 3: As 3 with modification UAR from Yang et al 2010 OPTION whichfreq x specifies what frequencies are used to create G. The variable x can be: 0: read from file as specified in option FreqFile 1: 0.5 2: current calcuated from genotypes(default) OPTION minfreq x ignore all SNP with MAF < x active only when whichfreq # 1 OPTION FreqFile ../freqdata (default freqdata if options 0 or 1 for center or scale are chosen) Reads frequency from a file, for example based on allele frequencies calculated by estfreq.f90 (VanRaden, 2009) format : snp, freq OPTION whichScale x specifies how G is scaled. The variable x can be: codes 1: 2sum(p(1-p) - VanRaden 2008 (default) 2: trace(ZZ')/n - Legarra 2009, Hayes 2009 3: correction - Gianola et al 2009 OPTION maxsnps x Set the maximum length of string for reading marker data from file, only necessary if greater than default (400,000) OPTION weightedG Reads weight from file to create weighted G = ZDZ' Quality Control (QC) for G ========================== By default the following QC are run: MAF Call rate (SNPs & animals) Monomorphic Parent-progeny conflicts (SNPs & animals) OPTION minfreq x ignore all SNP with MAF < x, default value 0.05 OPTION callrate x ignores SNPs with call rates < x (number of calls/number of individuals with genotypes), default 0.90 OPTION callrateAnim x ignores Genotypes with call rates < x (number of calls/number of SNPs), default 0.90 OPTION monomorphic ignores monomorphic SNPs OPTION hwe [x] check departure of heterozigous from Hardy-Weinberg Equilibrium, by default this QC is not run optional parameter x set the maximum difference between observed and expected frequency, if no value 0.15 is used (Wiggans et al 2009 JDS) OPTION high_correlation [x] check for high correlated locus, by default this QC is not run optional parameter x set the maximum difference in allele frequency to check pair of locus, if no value 0.025 is used. a pair of locus is consider high correlated if the genotypes were all the same (0-0, 1-1, 2-2) or the opposite (0-2, 1-1, 2-0) (Wiggans et al 2009 JDS) OPTION verify_parentage x Verify parent-progeny Mendelian conflicts. Write report to file Gen_conflicts. Option x could be: 0: no action 1: only detect 2: detect and search for an alternate parent; no change to any file. Not yet implemented 3: detect and elminate animals with conflicts (default) OPTION exclusion_threshold x Number of parent-progeny exclusions as precentage all SNP for determining wrong relationship (default = 2) OPTION exclusion_threshold_snp Number of parent-progeny exclusions for each locus as precentage all genotyped animals to exclude an SNP from the analysis (default = 10) OPTION excludeCHR n1 n2 n3 ... Exclude chromosomes n1 n2 n3 ... Map file need to be provided. See OPTION chrinfo OPTION sex_chr n Chromosomes greater than n are consider not autosomes. Checks for parent-progeny Mendeliand and HWE exclude sex chromosomes. Map file need to be provided. See OPTION chrinfo OPTION threshold_duplicate_samples x Set the threshold to issue a warning for possible duplicate samples if G(i,j) > x (default 0.9) OPTION threshold_diagonal_g [x] Check for big diagonals in genomic relationship matrix. If optional x is present set the threshold (default 1.6) OPTION saveCleanSNPs save clean genotype data OPTION no_quality_control This option turn off all quality control !!!! Qality Control for Off-diagonal of A22 and G ============================================ OPTION thrWarnCorAG x Set the threshold to issue a warning if cor(A22,G) < x (default 0.5) OPTION thrStopCorAG x Set the threshold to Stop the analysis if cor(A22,G) < x (default 0.3) OPTION thrCorAG x Set the threshold to calculate corr(A22,G) for only A22 >= x (default 0.02) Options for H =============== Options include different weights to to create GimA22i as tau inv(alpha G + beta A22 + gamma I + delta) - omega inv(A22) where the parameters are to scale the genomic info to be compatioble with the pedigree info, to make matrices invertible in the presence of clones, and to control bias. The defaults are: tau=1 alpha =0.95 beta = 0.05 gamma=0 delta=0 omega=1 Options to change these defaults are specified as: OPTION TauOmega tau omega OPTION AlphaBeta alpha beta OPTION GammaDelta gamma delta OPTION tunedG x Scale G based on A22. The variable x can be (default=2): 0: no scaling 1: mean(diag(G))=1, mean(offdiag(G))=0 2: mean(diag(G))=mean(diag(A22)), mean(offdiag(G))=mean(offdiag(A22)) 3: mean(G)=mean(A22) 4: rescale G as in Powell et al. (2010) or Vitezica et al. (2011); same as adjustment by Fst Misc options ============== OPTION nthreads Specify number of Threads to be used with MKL-OpenMP OPTION nthreadsiod Specify number of Threads to be used with MKL-OpenMP in blup90iod2 iterations Option used if Matrix-Vector multiplication in pcg are done with MKL subroutines OPTION graphics [s] Allows to generate Gnuplot graphics. Avoid using in batch programs. If present optional parameter s, set the time in seconds to show the plot. OPTION plotsnp Control the values of SNP effects to use in Manhatan plots 1 - plot regular SNP effects: abs(val) 2 - plot standarized SNP effecst: abs(val/sd) (default) OPTION SNP_moving_average [n] Solutions for SNP effects will be by moving averege of 30 adjacents SNPs. Optional parameter n set new windows size to calculate average. OPTION chrinfo Read SNP map information to do GWAS in postGSf90, mandatory columns (numeric) are: snp number, chr, position. Other alphanumeric field are optional, and if option saveCleanSNPs is present fields are output. First line in file corresponds to firs SNP in genotype file, etc. No specific order is neccessary. OPTION msg set level of verbose 0 none OPTION whichMatVecMult Control del method to multiply matrix-vector in blup90iod2 iterations 1 - mult from FSPAK90 2 - dsymv from LAPACK ( default, require link with MKL/ATLAS,etc, if not matmul will be used) 3 - dgemv as before 4 - dsymm as before 5 - dgemm as before 6 - matmul for timing comparison Save and Read options ======================= OPTION saveHinv Save H inverse matrix in Hinv.txt form: i,j,val with IDs renumbered by RENUMF90 OPTION saveHinvOrig As above but with original IDs OPTION saveDiagGOrig Save diagonal of G matrix in DiagGOrig.txt form: id, val with original IDs OPTION saveGOrig Save G matrix in G_Orig.txt form: id_i, id_j, val with original IDs OPTION saveA22Orig Save A22 matrix in A22_Orig.txt form: id_i, id_j, val with original IDs OPTION saveAscii Save files as ASCII (default=binary) OPTION savePLINK Save Genotyped in PLINK format; files: toPLINK.ped and toPLINK.map OPTION readG [file] OPTION readGInverse [file] OPTION readA22 [file] OPTION readA22Inverse [file] OPTION readGmA22 [file] OPTION readGimA22i [file] OPTION saveA22 OPTION saveA22Inverse OPTION saveG [all] if optional all is present all intermediate matrices for G are saved OPTION saveGInverse OPTION saveGmA22Inverse OPTION saveGmA22 These options are to create optional matrices for testing only. OPTION createG [0,1] OPTION createGInverse [0,1] OPTION createA22 [0,1] OPTION createA22Inverse [0,1] OPTION createGmA22 [0,1] OPTION createGimA22i [0,1]