PreGSF90
is an interface program to the genomic
module to process the genomic information for the BLUPF90
family of programs
This page also describes some options for PostGSF90
which is designed for genome-wide assocication study (GWAS).
Ignacio Aguilar and Ignacy Misztal, University of Georgia
email: iaguilar at inia.org.uy; ignacy at uga.edu
01/29/09 - 07/30/14
Program PreGSF90
helps to implement the genomic selection following the
single-step methodology as presented by
Aguilar et al. 2010 JDS.
In this methodology the relationship matrix A based on the pedigree information is replaced by
matrix H, which combines the pedigree and genomic information.
The main difference between A-1 and H-1 is matrix of structure
GimA22i=inv(G)-inv(A22),
where G is a genomic relationship matrix and
A22 is a relationship matrix for genotyped animals.
Efficient methods for the creation of the genomic relationship matrix, relationship based on pedigree and their inverses are described in Aguilar et al., 2011 JABG.
Program PreGSF90 could be run after RENUMf90
.
It is also run automatically by application programs like BLUPF90
, REMLF90
, GIBBSxF90
or BLUP90IOD
when their parameter file contains OPTION SNP_file filename
.
* Parameter file (ie renf90.par
) as created by RENUMF90
with genotype file specified with keyword SNP_FILE
* Genotype file
Fields need to be separated by at least one space and Field 2 should be fixed format i.e. all rows of genotypes *must* start at the same column number!!!
80 21101011002012011011010110111111211111210100 8014 21110101511101120221110111511112101112210100 516 21100101202252021120210121102111202212111101 181 21110111112201120550200020101022212211111100
using fractional genotypes is also possible, i.e. from imputation. In this case the genotypes must be “stick” together with two decimal places i.e.
80 2.001.001.000.001.00 8014 2.001.001.001.000.00 516 2.001.001.000.000.00 1032 0.501.120.251.502.00
where the last individual has “fractional” genotype 0.50 1.12 0.25 1.50 2.00
.
An utility program (illumina2pregs) is available to converts Illumina FinalReport and SNP_Map.txt into such format.
Some useful options:
OPTION missingAIPL
will read and convert genotype codes 3
and 4
as missing, i.e. internally they are read and converted to 5
.OPTION QMSim
will read and convert genotype codes 3
and 4
as 1
, i.e. heterozygote. This is useful e.g. if QMSim is used to produce the simulations.\OPTION fastread
will use C's library zlib for faster reading (useful for *very* large number of genotypes). This library is always available in all systems and in that case it defaults to standard reading.* Renumbered ID for genotypes
File gt.XrefID, where “gt” is genotype file.
This file is created by RENUMF90
.
Contains IDs renumbered sequentially (by RENUMF90
) and the original IDs.
Rows of this file should be in the same order as in the genotype file !!!
1732 80 8474 8014 406 516 9441 181
Pedigree file from RENUMF90
1732 11010 10584 1 3 12 1 0 0 80 8474 8691 9908 1 3 12 1 0 0 8014 406 8691 9825 1 3 12 1 0 2 516 9441 8691 8829 1 3 12 1 0 0 181
* Optional files:
OPTION FreqFile
)OPTION map_file
)OPTION weightedG
)As specified by respective OPTION lines.
OPTION map_file file
Read SNP map information from file.
Useful for check for Mendelian conflicts and HWE (with also OPTION sex_chr
) and for GWAS (see PostGSF90
program)
The file should have a header with the following column names:
SNP_ID - identification of the SNP (alphanumeric)
CHR - chromosome number (numeric), starting from 1
POS - position bp (numeric)
Extra columns are possible
The first SNP in the Map file corresponds to the first SNP in the genotype file, and so on.
Other alphanumeric fields are optional.
If OPTION saveCleanSNPs
is present fields are output.
GimA22i
By default PreGSf90
always create this file in binary format for use by later programs specifying OPTION readGimA22i
.
With OPTION saveAscii
this file can be stored as ASCII.
Format: i,j, inv(G)-inv(A22)
freqdata.count
contains allele frequencies in the original genotype file, previous any QC.
format: snp number (related to the genotype file) and allele frequency
freqdata.count.after.clean
contains allele frequencies as used in calculations.
format: snp number (related to the genotype file), allele frequency and code of exclusion.
Codes:
Gen_call_rate
contains list of animals excluded with call rate below the threshold.
Gen_conflicts
contains report of animals with Mendelian conflicts with their parents.
The first part of the file includes:
The second part of the file includes:
Optional
The program can store files such as G or its inverse,
A22 or its inverse, or other reports from QC as specified by their respective OPTION lines.
The genomic relationship matrix can be created in different ways.
OPTION whichG x
specifies how G is created. The variable x can be:
OPTION whichfreq x
specifies what frequencies are used to create G in G=ZDZ'/k. The same frequencies are used to center Z and to scale in D. The variable x can be:
freqdata
or from other file as specified using OPTION FreqFile
OPTION whichfreqScale x
Specifies what frequencies are used to scale G in G=ZDZ'/k. Use this option if, for instance, want to use 0.5 for centering (using option above) but observed 2pq for scaling. The variable x can be:
OPTION FreqFile
If option whichfreq is used, whichfreqScale will be considered only if it is read later in the parameter file
OPTION FreqFile file
Reads allele frequencies from a file, for example based on allele frequencies calculated by estfreq.f90 (VanRaden, 2009)
format : snp, freq
where snp corresponds to the index of SNP based on the same order that are in the genotype file.
If whichfreq
is set to 0 the default file name is freqdata
OPTION whichScale x
specifies how G is scaled. The variable x can be:
OPTION weightedG file
Reads weights from file to create weighted genomic relationship.
Weighting Z*= Z sqrt(D) ⇒ G = Z*Z*' = ZDZ'.
format: one column of weights in the same order as in the genotyped file.
Weights can be extracted from the output of PostGSF90
.
OPTION maxsnp x
Set the maximum length of string for reading marker data from file, only necessary if greater than default (400,000)
By default the following QC are run:
Parameters can be modified with the following options:
OPTION minfreq x
ignores all SNP with MAF < x,
default value 0.05
OPTION callrate x
ignores SNP with call rates < x (number of calls/number of individuals with genotypes)
default value 0.90
OPTION callrateAnim x
ignores Genotypes with call rates < x (number of calls/number of SNPs),
default value 0.90
OPTION monomorphic x
ignores monomorphic SNPs Optional parameter x can be used to enable (1) or disable (0) the check, default value 1
OPTION hwe x
check departure of heterozygous from Hardy-Weinberg Equilibrium.
By default, this QC is not run.
Optional parameter x
set the maximum difference between observed and expected frequency
the default value is 0.15 as used in Wiggans et al., 2009 JDS
OPTION high_correlation x y
check for high correlated SNP
By default this QC is not run.
Optional parameter x set the maximum difference in allele frequency to check pair of locus.
If no value 0.025 is used. Decrease this value to speed up calculation
A pair of locus is considered highly correlated if all the genotypes are the same (0-0, 1-1, 2-2) or the opposite (0-2, 1-1, 2-0) (Wiggans et al 2009 JDS)
Optional parameter y can be used to set a threshold to check the number of identical samples out of the number of genotypes.
default values x=0.025 y=0.995
OPTION verify_parentage x
Verify parent-progeny Mendelian conflicts.
Write report to file Gen_conflicts
Option x could be:
OPTION exclusion_threshold x
Number of parent-progeny exclusions as percentage all SNP to determine the wrong relationship.
default value 1
OPTION exclusion_threshold_snp x
Number of parent-progeny exclusions for each locus as a percentage, of pair of genotyped animals evaluated, to exclude an SNP from the analysis
default value 10
OPTION number_parent_progeny_evaluations x
Set the number of minimum pair of parent-progeny evaluations to exclude SNPs due to parent-progeny exclusions
default value 100
OPTION outparent_progeny
Create a full log file (Gen_conflicts_all
) with all pairs of parent-progeny tested for Mendelian conflicts.
OPTION out_snp_exclusion_error_rate
Create a log file (SNP_Mendelian_error_rate
) with statistics for SNP Mendelian error rate.
OPTION excludeCHR n1 n2 n3 ...
Exclude all SNP from chromosomes n1 n2 n3 …
Map file need to be provided. See OPTION map_file
OPTION includeCHR n1 n2 n3 ...
Include all SNP from chromosomes n1 n2 n3 …
Map file need to be provided. See OPTION map_file
OPTION excludeSample n1 n2 n3 ...
Exclude genotype samples n1 n2 n3 …
n1, n2… are position number of individuals in the genotype file.
OPTION sex_chr n
Chromosomes with a number greater or equal to n are not considered as autosomes.
If selected this option, sex chromosomes will not be used for checking parent-progeny Mendelian conflicts, HWE and heritability of gene content
but they will be included for all remaining processes. If you want to remove sex chromosomes, which we do recommend, use OPTION excludeCHR
.
Map file need to be provided. See OPTION map_file
. However, note that sex chromosomes are not excluded
OPTION threshold_duplicate_samples x
Set the threshold to issue a warning for possible duplicate samples if G(i,j)/sqrt(G(i,i)*G(j,j)) > x
default value 0.9
OPTION high_threshold_diagonal_g x
Check and provide a list of individuals with extreme high diagonals in the genomic relationship matrix.
If optional x is present set the threshold
default value 1.6
OPTION low_threshold_diagonal_g x
Check and provide a list of individuals with extreme low diagonals in the genomic relationship matrix.
If optional x is present set the threshold
default value 0.7
OPTION plotpca <print/noprint>
Plot the first two principal components to look for stratification in the population. With noprint the program will save the first two principal components of G without showing the PCA plot on the screen; otherwise (print) it will print on the screen.
OPTION extra_info_pca file col
Reads from file the column col to plot with different colors for different classes.
The file should contain at least one variable with different classes for each genotyped individual, and the order should match the order of the genotypes file.
Variables could be alphanumeric and separated by one o more spaces.
OPTION calculate_LD
Calculate LD as the squared correlation of allele counts for two SNP
Results are stored in “ld_results”, columns: snp_i, chr_i, pos_i, freq_i, snp_j, chr_j, pos_j,freq_j, dist_ij, Rsq_ij
OPTION LD_by_chr
Calculate LD within chromosome
OPTION LD_by_pos x
Calculate LD within chromosome and windows of SNP based on position. Optional parameter x define with windows size in Bp, default value 200000
OPTION filter_by_LD x
Filter SNP with Rsq > threshold. Optional parameter x define the threshold. default value 0.8
OPTION thr_output_LD x
Threshold to print out Rsq between pair of SNP Optional parameter x define the threshold. default value 0.1
OPTION saveCleanSNPs
Save clean genotype data with excluded SNP and animals; based on the OPTIONS specified.
*_clean
files are created:
gt_clean
gt_clean_XrefID
*_removed
files are created:
gt_SNPs_removed
gt_Animals_removed
with gt being the genotype file
OPTION no_quality_control
This option turns off all quality control !!!!
Useful to speed up run when previous QC data was performed.
OPTION outcallrate
This option will print all call rate information for SNP and individuals.
The files callrate
(SNP) and callrate_a
(Individuals) are created.
OPTION h2_gene_content
It checks that the heritability of gene content is equal or close to 1 as described in
Forneris et al. Genetics 199.3 (2015): 675-681. Markers with estimated h2<0.98 and significant p-values of the LRT (p<0.01) are discarded. In addition, heritability and status of each marker are written in file h2_gc_test
.
The test is useful for homogenous populations (breeds) but theory does not hold for crossbred animals. This test uses explicitly inv(A22) so it is not suitable for very large populations.
OPTION thrWarnCorAG x
Set the threshold to issue a warning if cor(A22,G) < x
default value = 0.5
OPTION thrStopCorAG x
Set the threshold to Stop the analysis if cor(A22,G) < x
default value = 0.3
OPTION thrCorAG x
Set the threshold to calculate corr(A22,G) for only A22 >= x
default value = 0.02
Options include different weights to create GimA22i as:
tau inv(alpha G + beta A22 + gamma I + delta) - omega inv(A22)
where the parameters are to scale the genomic info to be compatible with
the pedigree info, to make matrices invertible in the presence of clones,
and to control bias.
The default values are:
tau=1 alpha =0.95 beta = 0.05 gamma=0 delta=0 omega=1
Options to change these defaults are specified with:
OPTION TauOmega tau omega
OPTION AlphaBeta alpha beta
OPTION GammaDelta gamma delta
OPTION tunedG x
Scale G based on A22.
The variable x
can be:
OPTION tunedG 9 a b
.The diagonal of H contains an improved estimator of inbreeding: $\mathbf{F}_H = diag(\mathbf{H})-1$ . For genotyped animals, the diagonal of H is identical to the diagonal of G , and thus $\mathbf{F}_H=\mathbf{F}_G$ , sometimes called Genomic Inbreeding. The elements of $\mathbf{F}_H$ for non-genotyped animals include pedigree-based estimates of Genomic Inbreeding. See Colleau et al 2017 and Legarra et al. 2019. The last contains a description of the underlying methods.
To extract the diagonal of H one of these two OPTION
s needs to be used:
OPTION saveDiagH
outputs $diag(\mathbf{H})$ with renumbered id'sOPTION saveDiagHOrig
outputs $diag(\mathbf{H})$ with original and renumbered id'sUser can use one of two equivalent methods :
OPTION methodDiagH 1
, does a sparse inversion of $\mathbf{H}^{-1}$ (default) . This option is very fast for small to medium pedigrees.OPTION methodDiagH 2
, this is in fact Method 3 in Legarra et al. 2019, an outer product method that uses $\mathbf{M}=\mathbf{A}_{22}^{-1}(\mathbf{G}-\mathbf{A}_{22})\mathbf{A}_{22}^{-1}$. This method is recommended for large pedigrees as it is (for large pedigrees) less time and memory consuming.
The output depends on the method used. OPTION methodDiagH 1
shows only individual id and $diag(\mathbf{H})$. OPTION methodDiagH 2
shows individual id and the values of $diag(\mathbf{H})$, $diag(\mathbf{A})$ (pedigree-based relationship) and the difference $c$ such that $diag(\mathbf{H})=diag(\mathbf{A})+c$.
An example of the output obtained with OPTION methodDiagH 2
and OPTION saveDiagHOrig
has original_id, $diag(\mathbf{H})$, $diag(\mathbf{A})$, and the difference $c=diag(\mathbf{H})-diag(\mathbf{A})$, and the renumbered_id:
testDiagH2_mf andres$ head diagHdirect.txt.2 45036060023 1.179829759235491 1.250322947770358 -0.070493188534867 1 64000169880047 1.126222691080622 1.220500000000000 -0.094277308919378 2 64000246030053 1.168573237141459 1.237528320312500 -0.068955083171041 3 45038980011 1.189937645185136 1.251295679897070 -0.061358034711934 4
OPTION Manhattan_plot
Uses GNUPLOT to plot the Manhattan plot (SNP effects) for each trait and correlated effect.
OPTION Manhattan_plot_R
Uses R to plot the Manhattan plot (SNP effects) for each trait and correlated effect.
pdf
images are created: manplot_St1e2.pdf, but other formats can be specified.
Note: t1e2 corresponds to trait 1, effect 2.
OPTION Manhattan_plot_R_format <format>
Control the format type to create images in R
format
values accepted:
OPTION plotsnp <n>
Control the values of SNP effects to use in Manhattan plots
OPTION SNP_moving_average n
Solutions for SNP effects will be by moving average of n adjacents SNPs.
OPTION windows_variance n
Calculates the variance explained by n adjacents SNPs.
When this option is used, the sum of variance explained by n adjacent SNPs (column 8 of snp_sol or column 3 of chrsnpvar) is not 100%.
This is because moving variance is used. If windows size is 20, the proportion of variance assigned to SNP 1 is calculated from SNP 1 to 20, for SNP 2 it goes from 2 to 21, for SNP 3 it goes from 3 to 22, and so forth.
A file called windows_variance has variance that sums to 100% in column 9.
OPTION windows_variance_mbp n
Calculates the variance explained by n Mb window of adjacents SNPs.
OPTION windows_variance_type n
Sets windows type for variances calculations:
OPTION which_weight x
Generates a weight variable w to be used in the creation of a weighted genomic relationship matrix G=ZDZ'
where y is the SNP solution, with scaled weight = w * nSnp/sum(w); and C is 1.125 by default (enable to change it using the second argument of the option line e.g. OPTION which_weight nonlinearA 1.2
).
OPTION solutions_postGS x
Sets the file name for the solutions file, default solutions
OPTION postgs_trt_eff x1 x2
Computes postGS calculations (SNP solutions, variance explained, etc.) for only trait: x1 and effect: x2
OPTION snp_p_value
Computes p-values for GWAS from elements of the inverse of the Mixed Model Equations previously obtained from blupf90. This requires quite a lot of memory and time. For details see Aguilar et al. (2019).
OPTION snp_var
Creates a file with prediction error covariance (PEC) for SNP to be used in PREDF90 to compute reliability for indirect predictions. This option works when OPTION snp_p_value
is used in BLUPF90+.
snp_sol
contains solutions of SNP and weights
if OPTION windows_variance
is used
if OPTION snp_p_value
is used
chrsnp
contains data to create plot by GNUPLOT
chrsnp_pval
contains data to create plot by GNUPLOT
chrsnpvar
contains data to create plot by GNUPLOT
windows_segment
contains information of windows segments used to get variance explained
windows_variance
contains variance explained for the biggest non-overlapping windows segments
snp_pred
contains allele frequencies + SNP effects
Graphic control files
Several files are created to generate graphics using either GNUPLOT or R
File names rules:
e.g.: 'Sft1e2.R' first letter indicate 'S' for solutions of SNP 'V' for variance explained 'P' for p-values t1e2 indicates that the file is for the trait 1 and the effect 2 filename extension .gnuplot .R .pdf .png .tif
OPTION num_threads_pregs n
Specify the number of threads to be used with MKL-OpenMP for creating and inverting matrices
OPTION num_threads_iod n
Specify the number of threads to be used with MKL-OpenMP in BLUP90IOD
program
for matrix-vector multiplications in the PCG algorithm
OPTION graphics s
Allows to generate plots with GNUPLOT
program.
If present optional parameter s, set the time in seconds to show the plot.
Avoid using in batch programs !!!.
OPTION msg x
x set level of verbose; 0 minimal; 1 gives more diagnostic information
OPTION saveAscii
Saves files intermediate matrices (GimA22i,G,Gi,etc) files as ASCII (default=binary)
OPTION saveHinv
Saves H inverse matrix in Hinv.txt
Format: i,j,val
with i,j, the index level for the additive genetic effect
OPTION saveAinv
Saves A inverse matrix in Ainv.txt
Format: i,j,val
with i,j, the index level for the additive genetic effect
The following options use the information of the Original ID (alphanumeric) stored in the 10th column of the renaddxx.ped file, create by the RENUMF90
program.
OPTION saveHinvOrig
Saves the H inverse matrix with original IDs
OPTION saveAinvOrig
Saves the A inverse matrix with original IDs
OPTION saveDiagGOrig
Saves diagonal of G matrix in DiagGOrig.txt
Format: id, val
with id the original IDs
OPTION saveGOrig
Saves G matrix in G_Orig.txt
Format: id_i, id_j, val
with id_i and id_j the original IDs
OPTION saveA22Orig
Saves A22 matrix in A22_Orig.txt
Format: id_i, id_j, val
with id_i and id_j the original IDs
OPTION saveGimA22iOrig
Saves GimA22i matrix in GimA22i_Orig.txt
Format: id_i, id_j, val
with id_i and id_j the original IDs as stored in the original pedigree file and in column 10 of renaddXX.ped
OPTION readOrigId
Reads information from renaddxx.ped file, Original ID, and possibly year of birth for its use in parent-progeny conflict output.
Only need if none of the previous save*Orig
is present.
OPTION saveGimA22iRen
Saves GimA22i matrix in GimA22i_Ren.txt
Format: id_i, id_j, val
with id_i and id_j the renumbered IDs as stored in the first column of renaddXX.ped
OPTION savePLINK
Saves Genotype in PLINK
format
files: toPLINK.ped and toPLINK.map
OPTION no_full_binary
Saves the elements of half-matrix instead of the full matrix. It is useful to keep the compatibility with the older version of preGSf90. The newer versions save the matrix in a more efficient way, where reading the information from the binary file is not trivial (i.e., not as i, j, val anymore).
OPTION readGimA22i file
This option is used in the application programs (BLUPF90,REMLF90, etc.) to use the information already stored in GimA22i
file (default filename).
In general, methods used to create and invert matrices in such programs don't use an optimized version.
For a large number of genotyped animals run first PreGSf90
and the read stored matrices in analyses programs.
Optional file can be used to specify a different filename (other than GimA22i
) or a path, for example OPTION readGimA22i ../../pregsrun/GimA22i
.
Other intermediate matrices files can be stored for inspection or for it use in BLUPF90 programs as user_file
type of random effect. See tricks and REMLF90 for details.
OPTION saveA22 OPTION saveA22Inverse OPTION saveA22InverseOrig OPTION saveG [all] if optional all is present all intermediate matrices for G are saved OPTION saveGInverse OPTION saveGInverseOrig OPTION saveGmA22 OPTION readG [file] OPTION readGInverse [file] OPTION readA22 [file] OPTION readA22Inverse [file] OPTION readGmA22 [file]
OPTION chrinfo file
This is deprecated. Use instead OPTION map_file
.
Read SNP map information from file.
Useful for check for Mendelian conflicts and HWE (with also OPTION sex_chr
) and for GWAS (see PostGSF90
program)
Format, all numeric variables: snp order , chromosome, position (bp).
snp order corresponds to the index number of the snp, in the sorted map by chromosome and position
chromosome should be only numbers and starting from 1
First line in file corresponds to first SNP in genotype file, and so on.
Other alphanumeric field are optional.
If OPTION saveCleanSNPs
is present fields are output.