Table of Contents

PreGSF90 / PostGSF90

PreGSF90 is an interface program to the genomic module to process the genomic information for the BLUPF90 family of programs

This page also describes some options for PostGSF90 which is designed for genome-wide assocication study (GWAS).

Ignacio Aguilar and Ignacy Misztal, University of Georgia
email: iaguilar at inia.org.uy; ignacy at uga.edu
01/29/09 - 07/30/14

Summary

Program PreGSF90 helps to implement the genomic selection following the single-step methodology as presented by Aguilar et al. 2010 JDS.
In this methodology the relationship matrix A based on the pedigree information is replaced by matrix H, which combines the pedigree and genomic information.

The main difference between A-1 and H-1 is matrix of structure
GimA22i=inv(G)-inv(A22),
where G is a genomic relationship matrix and
A22 is a relationship matrix for genotyped animals.

Efficient methods for the creation of the genomic relationship matrix, relationship based on pedigree and their inverses are described in Aguilar et al., 2011 JABG.
Program PreGSF90 could be run after RENUMf90.
It is also run automatically by application programs like BLUPF90, REMLF90, GIBBSxF90 or BLUP90IOD when their parameter file contains OPTION SNP_file filename.

Input files

* Parameter file (ie renf90.par) as created by RENUMF90 with genotype file specified with keyword SNP_FILE

* Genotype file

Fields need to be separated by at least one space and Field 2 should be fixed format i.e. all rows of genotypes *must* start at the same column number!!!

 80   21101011002012011011010110111111211111210100
 8014 21110101511101120221110111511112101112210100
 516  21100101202252021120210121102111202212111101
 181  21110111112201120550200020101022212211111100

using fractional genotypes is also possible, i.e. from imputation. In this case the genotypes must be “stick” together with two decimal places i.e.

 80   2.001.001.000.001.00
 8014 2.001.001.001.000.00
 516  2.001.001.000.000.00 
1032  0.501.120.251.502.00

where the last individual has “fractional” genotype 0.50 1.12 0.25 1.50 2.00.

An utility program (illumina2pregs) is available to converts Illumina FinalReport and SNP_Map.txt into such format.

Some useful options:

* Renumbered ID for genotypes

File gt.XrefID, where “gt” is genotype file.
This file is created by RENUMF90. Contains IDs renumbered sequentially (by RENUMF90) and the original IDs.
Rows of this file should be in the same order as in the genotype file !!!

1732 80                                              
8474 8014                                              
406 516                                              
9441 181

Pedigree file from RENUMF90

1732 11010 10584 1 3 12  1 0 0 80
8474  8691  9908 1 3 12  1 0 0 8014
 406  8691  9825 1 3 12  1 0 2 516
9441  8691  8829 1 3 12  1 0 0 181

* Optional files:

As specified by respective OPTION lines.

OPTION map_file file 

Read SNP map information from file.

Useful for check for Mendelian conflicts and HWE (with also OPTION sex_chr) and for GWAS (see PostGSF90 program)

The file should have a header with the following column names:
SNP_ID - identification of the SNP (alphanumeric)
CHR - chromosome number (numeric), starting from 1
POS - position bp (numeric)

Extra columns are possible

The first SNP in the Map file corresponds to the first SNP in the genotype file, and so on.

Other alphanumeric fields are optional.

If OPTION saveCleanSNPs is present fields are output.

Output files

GimA22i
By default PreGSf90 always create this file in binary format for use by later programs specifying OPTION readGimA22i.
With OPTION saveAscii this file can be stored as ASCII. Format: i,j, inv(G)-inv(A22)

freqdata.count
contains allele frequencies in the original genotype file, previous any QC.
format: snp number (related to the genotype file) and allele frequency

freqdata.count.after.clean
contains allele frequencies as used in calculations.
format: snp number (related to the genotype file), allele frequency and code of exclusion.

Codes:

Gen_call_rate
contains list of animals excluded with call rate below the threshold.

Gen_conflicts
contains report of animals with Mendelian conflicts with their parents. The first part of the file includes:

The second part of the file includes:

Optional
The program can store files such as G or its inverse,
A22 or its inverse, or other reports from QC as specified by their respective OPTION lines.

Options for creation of genomic relationship Matrix (G)

The genomic relationship matrix can be created in different ways.

OPTION whichG x

specifies how G is created. The variable x can be:

 OPTION whichfreq x

specifies what frequencies are used to create G in G=ZDZ'/k. The same frequencies are used to center Z and to scale in D. The variable x can be:

OPTION whichfreqScale x

Specifies what frequencies are used to scale G in G=ZDZ'/k. Use this option if, for instance, want to use 0.5 for centering (using option above) but observed 2pq for scaling. The variable x can be:

If option whichfreq is used, whichfreqScale will be considered only if it is read later in the parameter file

OPTION FreqFile  file 

Reads allele frequencies from a file, for example based on allele frequencies calculated by estfreq.f90 (VanRaden, 2009)
format : snp, freq where snp corresponds to the index of SNP based on the same order that are in the genotype file.

If whichfreq is set to 0 the default file name is freqdata

OPTION whichScale x

specifies how G is scaled. The variable x can be:

OPTION weightedG file 

Reads weights from file to create weighted genomic relationship.
Weighting Z*= Z sqrt(D) ⇒ G = Z*Z*' = ZDZ'.
format: one column of weights in the same order as in the genotyped file.
Weights can be extracted from the output of PostGSF90.

OPTION maxsnp x

Set the maximum length of string for reading marker data from file, only necessary if greater than default (400,000)

Quality Control (QC) for G

By default the following QC are run:

Parameters can be modified with the following options:

OPTION minfreq x

ignores all SNP with MAF < x,
default value 0.05

OPTION callrate x

ignores SNP with call rates < x (number of calls/number of individuals with genotypes)
default value 0.90

OPTION callrateAnim x

ignores Genotypes with call rates < x (number of calls/number of SNPs),
default value 0.90

OPTION monomorphic x

ignores monomorphic SNPs Optional parameter x can be used to enable (1) or disable (0) the check, default value 1

OPTION hwe x

check departure of heterozygous from Hardy-Weinberg Equilibrium.
By default, this QC is not run.
Optional parameter x set the maximum difference between observed and expected frequency
the default value is 0.15 as used in Wiggans et al., 2009 JDS

OPTION high_correlation x y

check for high correlated SNP
By default this QC is not run.
Optional parameter x set the maximum difference in allele frequency to check pair of locus.
If no value 0.025 is used. Decrease this value to speed up calculation
A pair of locus is considered highly correlated if all the genotypes are the same (0-0, 1-1, 2-2) or the opposite (0-2, 1-1, 2-0) (Wiggans et al 2009 JDS)
Optional parameter y can be used to set a threshold to check the number of identical samples out of the number of genotypes.
default values x=0.025 y=0.995

OPTION verify_parentage x

Verify parent-progeny Mendelian conflicts.
Write report to file Gen_conflicts
Option x could be:

OPTION exclusion_threshold x

Number of parent-progeny exclusions as percentage all SNP to determine the wrong relationship.
default value 1

OPTION exclusion_threshold_snp x

Number of parent-progeny exclusions for each locus as a percentage, of pair of genotyped animals evaluated, to exclude an SNP from the analysis
default value 10

OPTION number_parent_progeny_evaluations x

Set the number of minimum pair of parent-progeny evaluations to exclude SNPs due to parent-progeny exclusions
default value 100

OPTION outparent_progeny

Create a full log file (Gen_conflicts_all) with all pairs of parent-progeny tested for Mendelian conflicts.

OPTION out_snp_exclusion_error_rate

Create a log file (SNP_Mendelian_error_rate) with statistics for SNP Mendelian error rate.

OPTION excludeCHR n1 n2 n3 ...

Exclude all SNP from chromosomes n1 n2 n3
Map file need to be provided. See OPTION map_file

OPTION includeCHR n1 n2 n3 ...

Include all SNP from chromosomes n1 n2 n3
Map file need to be provided. See OPTION map_file

OPTION excludeSample n1 n2 n3 ...

Exclude genotype samples n1 n2 n3
n1, n2… are position number of individuals in the genotype file.

OPTION sex_chr n

Chromosomes with a number greater or equal to n are not considered as autosomes.
If selected this option, sex chromosomes will not be used for checking parent-progeny Mendelian conflicts, HWE and heritability of gene content
but they will be included for all remaining processes. If you want to remove sex chromosomes, which we do recommend, use OPTION excludeCHR.

Map file need to be provided. See OPTION map_file. However, note that sex chromosomes are not excluded

OPTION threshold_duplicate_samples x

Set the threshold to issue a warning for possible duplicate samples if G(i,j)/sqrt(G(i,i)*G(j,j)) > x
default value 0.9

OPTION high_threshold_diagonal_g x

Check and provide a list of individuals with extreme high diagonals in the genomic relationship matrix.
If optional x is present set the threshold
default value 1.6

OPTION low_threshold_diagonal_g x

Check and provide a list of individuals with extreme low diagonals in the genomic relationship matrix.
If optional x is present set the threshold
default value 0.7

OPTION plotpca <print/noprint>

Plot the first two principal components to look for stratification in the population. With noprint the program will save the first two principal components of G without showing the PCA plot on the screen; otherwise (print) it will print on the screen.

OPTION extra_info_pca file col

Reads from file the column col to plot with different colors for different classes.
The file should contain at least one variable with different classes for each genotyped individual, and the order should match the order of the genotypes file.
Variables could be alphanumeric and separated by one o more spaces.

OPTION calculate_LD

Calculate LD as the squared correlation of allele counts for two SNP

Results are stored in “ld_results”, columns: snp_i, chr_i, pos_i, freq_i, snp_j, chr_j, pos_j,freq_j, dist_ij, Rsq_ij

OPTION LD_by_chr

Calculate LD within chromosome

OPTION LD_by_pos x

Calculate LD within chromosome and windows of SNP based on position. Optional parameter x define with windows size in Bp, default value 200000

OPTION filter_by_LD x

Filter SNP with Rsq > threshold. Optional parameter x define the threshold. default value 0.8

OPTION thr_output_LD x

Threshold to print out Rsq between pair of SNP Optional parameter x define the threshold. default value 0.1

OPTION saveCleanSNPs

Save clean genotype data with excluded SNP and animals; based on the OPTIONS specified.
*_clean files are created:

*_removed files are created:

with gt being the genotype file

OPTION no_quality_control

This option turns off all quality control !!!!
Useful to speed up run when previous QC data was performed.

OPTION outcallrate

This option will print all call rate information for SNP and individuals.
The files callrate (SNP) and callrate_a (Individuals) are created.

OPTION h2_gene_content

It checks that the heritability of gene content is equal or close to 1 as described in Forneris et al. Genetics 199.3 (2015): 675-681. Markers with estimated h2<0.98 and significant p-values of the LRT (p<0.01) are discarded. In addition, heritability and status of each marker are written in file h2_gc_test.

The test is useful for homogenous populations (breeds) but theory does not hold for crossbred animals. This test uses explicitly inv(A22) so it is not suitable for very large populations.

Quality Control for Off-diagonal of A22 and G

OPTION thrWarnCorAG x

Set the threshold to issue a warning if cor(A22,G) < x
default value = 0.5

OPTION thrStopCorAG x

Set the threshold to Stop the analysis if cor(A22,G) < x
default value = 0.3

OPTION  thrCorAG x

Set the threshold to calculate corr(A22,G) for only A22 >= x
default value = 0.02

Options for H

Options include different weights to create GimA22i as:

tau inv(alpha G + beta A22 + gamma I + delta) - omega inv(A22)

where the parameters are to scale the genomic info to be compatible with
the pedigree info, to make matrices invertible in the presence of clones,
and to control bias.

The default values are:

tau=1 alpha =0.95 beta = 0.05 gamma=0 delta=0 omega=1

Options to change these defaults are specified with:

OPTION TauOmega tau omega
OPTION AlphaBeta alpha beta
OPTION GammaDelta gamma delta
OPTION tunedG x

Scale G based on A22.
The variable x can be:

Options to extract the diagonal of H (aka genomic improved inbreeding)

The diagonal of H contains an improved estimator of inbreeding: $\mathbf{F}_H = diag(\mathbf{H})-1$ . For genotyped animals, the diagonal of H is identical to the diagonal of G , and thus $\mathbf{F}_H=\mathbf{F}_G$ , sometimes called Genomic Inbreeding. The elements of $\mathbf{F}_H$ for non-genotyped animals include pedigree-based estimates of Genomic Inbreeding. See Colleau et al 2017 and Legarra et al. 2019. The last contains a description of the underlying methods.

To extract the diagonal of H one of these two OPTIONs needs to be used:

User can use one of two equivalent methods :

The output depends on the method used. OPTION methodDiagH 1 shows only individual id and $diag(\mathbf{H})$. OPTION methodDiagH 2 shows individual id and the values of $diag(\mathbf{H})$, $diag(\mathbf{A})$ (pedigree-based relationship) and the difference $c$ such that $diag(\mathbf{H})=diag(\mathbf{A})+c$.

An example of the output obtained with OPTION methodDiagH 2 and OPTION saveDiagHOrig has original_id, $diag(\mathbf{H})$, $diag(\mathbf{A})$, and the difference $c=diag(\mathbf{H})-diag(\mathbf{A})$, and the renumbered_id:

testDiagH2_mf andres$ head diagHdirect.txt.2
45036060023                1.179829759235491  1.250322947770358 -0.070493188534867          1
64000169880047             1.126222691080622  1.220500000000000 -0.094277308919378          2
64000246030053             1.168573237141459  1.237528320312500 -0.068955083171041          3
45038980011                1.189937645185136  1.251295679897070 -0.061358034711934          4

GWAS options (PostGSF90)

OPTION Manhattan_plot

Uses GNUPLOT to plot the Manhattan plot (SNP effects) for each trait and correlated effect.

OPTION Manhattan_plot_R

Uses R to plot the Manhattan plot (SNP effects) for each trait and correlated effect.
pdf images are created: manplot_St1e2.pdf, but other formats can be specified.
Note: t1e2 corresponds to trait 1, effect 2.

OPTION Manhattan_plot_R_format <format>

Control the format type to create images in R
format values accepted:

OPTION plotsnp <n> 

Control the values of SNP effects to use in Manhattan plots

OPTION SNP_moving_average n

Solutions for SNP effects will be by moving average of n adjacents SNPs.

OPTION windows_variance n

Calculates the variance explained by n adjacents SNPs.

When this option is used, the sum of variance explained by n adjacent SNPs (column 8 of snp_sol or column 3 of chrsnpvar) is not 100%.
This is because moving variance is used. If windows size is 20, the proportion of variance assigned to SNP 1 is calculated from SNP 1 to 20, for SNP 2 it goes from 2 to 21, for SNP 3 it goes from 3 to 22, and so forth.
A file called windows_variance has variance that sums to 100% in column 9.

OPTION windows_variance_mbp n

Calculates the variance explained by n Mb window of adjacents SNPs.

OPTION windows_variance_type n

Sets windows type for variances calculations:

OPTION which_weight x

Generates a weight variable w to be used in the creation of a weighted genomic relationship matrix G=ZDZ'

where y is the SNP solution, with scaled weight = w * nSnp/sum(w); and C is 1.125 by default (enable to change it using the second argument of the option line e.g. OPTION which_weight nonlinearA 1.2).

OPTION solutions_postGS x 

Sets the file name for the solutions file, default solutions

OPTION postgs_trt_eff x1 x2

Computes postGS calculations (SNP solutions, variance explained, etc.) for only trait: x1 and effect: x2

OPTION snp_p_value

Computes p-values for GWAS from elements of the inverse of the Mixed Model Equations previously obtained from blupf90. This requires quite a lot of memory and time. For details see Aguilar et al. (2019).

OPTION snp_var

Creates a file with prediction error covariance (PEC) for SNP to be used in PREDF90 to compute reliability for indirect predictions. This option works when OPTION snp_p_value is used in BLUPF90+.

Output files for GWAS (postGSf90)

snp_sol

contains solutions of SNP and weights

if OPTION windows_variance is used

if OPTION snp_p_value is used

chrsnp

contains data to create plot by GNUPLOT

chrsnp_pval

contains data to create plot by GNUPLOT

chrsnpvar

contains data to create plot by GNUPLOT

windows_segment

contains information of windows segments used to get variance explained

windows_variance

contains variance explained for the biggest non-overlapping windows segments

snp_pred

contains allele frequencies + SNP effects

Graphic control files

Several files are created to generate graphics using either GNUPLOT or R
File names rules:

e.g.: 'Sft1e2.R'

first letter indicate
  'S' for solutions of SNP 
  'V' for variance explained
  'P' for p-values
  
t1e2
  indicates that the file is for the trait 1 and the effect 2
  
filename extension
   .gnuplot
   .R
   .pdf 
   .png
   .tif

Misc options

OPTION num_threads_pregs n

Specify the number of threads to be used with MKL-OpenMP for creating and inverting matrices

OPTION num_threads_iod n

Specify the number of threads to be used with MKL-OpenMP in BLUP90IOD program
for matrix-vector multiplications in the PCG algorithm

OPTION graphics s

Allows to generate plots with GNUPLOT program.
If present optional parameter s, set the time in seconds to show the plot.
Avoid using in batch programs !!!.

OPTION msg x

x set level of verbose; 0 minimal; 1 gives more diagnostic information

Save and Read options

OPTION saveAscii

Saves files intermediate matrices (GimA22i,G,Gi,etc) files as ASCII (default=binary)

OPTION saveHinv

Saves H inverse matrix in Hinv.txt
Format: i,j,val
with i,j, the index level for the additive genetic effect

OPTION saveAinv

Saves A inverse matrix in Ainv.txt
Format: i,j,val
with i,j, the index level for the additive genetic effect

The following options use the information of the Original ID (alphanumeric) stored in the 10th column of the renaddxx.ped file, create by the RENUMF90 program.

OPTION saveHinvOrig

Saves the H inverse matrix with original IDs

OPTION saveAinvOrig

Saves the A inverse matrix with original IDs

OPTION saveDiagGOrig

Saves diagonal of G matrix in DiagGOrig.txt
Format: id, val
with id the original IDs

OPTION saveGOrig

Saves G matrix in G_Orig.txt
Format: id_i, id_j, val
with id_i and id_j the original IDs

OPTION saveA22Orig

Saves A22 matrix in A22_Orig.txt
Format: id_i, id_j, val
with id_i and id_j the original IDs

OPTION saveGimA22iOrig 

Saves GimA22i matrix in GimA22i_Orig.txt
Format: id_i, id_j, val
with id_i and id_j the original IDs as stored in the original pedigree file and in column 10 of renaddXX.ped

OPTION readOrigId

Reads information from renaddxx.ped file, Original ID, and possibly year of birth for its use in parent-progeny conflict output.
Only need if none of the previous save*Orig is present.

OPTION saveGimA22iRen 

Saves GimA22i matrix in GimA22i_Ren.txt
Format: id_i, id_j, val
with id_i and id_j the renumbered IDs as stored in the first column of renaddXX.ped

OPTION savePLINK

Saves Genotype in PLINK format
files: toPLINK.ped and toPLINK.map

OPTION no_full_binary

Saves the elements of half-matrix instead of the full matrix. It is useful to keep the compatibility with the older version of preGSf90. The newer versions save the matrix in a more efficient way, where reading the information from the binary file is not trivial (i.e., not as i, j, val anymore).

Save and Read intermediate files

OPTION readGimA22i file 

This option is used in the application programs (BLUPF90,REMLF90, etc.) to use the information already stored in GimA22i file (default filename).
In general, methods used to create and invert matrices in such programs don't use an optimized version.
For a large number of genotyped animals run first PreGSf90 and the read stored matrices in analyses programs.

Optional file can be used to specify a different filename (other than GimA22i) or a path, for example OPTION readGimA22i ../../pregsrun/GimA22i.

Other intermediate matrices files can be stored for inspection or for it use in BLUPF90 programs as user_file type of random effect. See tricks and REMLF90 for details.

OPTION saveA22 
OPTION saveA22Inverse 
OPTION saveA22InverseOrig
OPTION saveG [all]
       if optional all is present all intermediate matrices for G are saved
OPTION saveGInverse 
OPTION saveGInverseOrig
OPTION saveGmA22 

OPTION readG [file]
OPTION readGInverse [file]
OPTION readA22 [file]
OPTION readA22Inverse [file]
OPTION readGmA22 [file]          

DEPRECATED OPTIONS

OPTION chrinfo file 


This is deprecated. Use instead OPTION map_file.

Read SNP map information from file.

Useful for check for Mendelian conflicts and HWE (with also OPTION sex_chr) and for GWAS (see PostGSF90 program)

Format, all numeric variables: snp order , chromosome, position (bp).

snp order corresponds to the index number of the snp, in the sorted map by chromosome and position

chromosome should be only numbers and starting from 1

First line in file corresponds to first SNP in genotype file, and so on.

Other alphanumeric field are optional.

If OPTION saveCleanSNPs is present fields are output.