This is an old revision of the document!

PreGSF90 / PostGSF90

PreGSF90 is an interface program to the genomic module to process the genomic information for the BLUPF90 family of programs

This page also describes some options for PostGSF90 which is designed for genome-wide assocication study (GWAS).

Ignacio Aguilar and Ignacy Misztal, University of Georgia
email: iaguilar at inia.org.uy; ignacy at uga.edu
01/29/09 - 07/30/14

Summary

Program PreGSF90 helps to implement the genomic selection following the single-step methodology as presented by Aguilar et al. 2010 JDS.
In this methodology the relationship matrix A based on the pedigree information is replaced by matrix H, which combines the pedigree and genomic information.

The main difference between A^-1 and H^-1 is matrix of structure
GimA22i=inv(G)-inv(A₂₂),
where G is a genomic relationship matrix and
A₂₂ is a relationship matrix for genotyped animals.

Efficient methods for the creation of the genomic relationship matrix, relationship based on pedigree and their inverses are described in Aguilar et al., 2011 JABG.
Program PreGSF90 could be run after RENUMf90.
It is also run automatically by application programs like BLUPF90, REMLF90, GIBBSxF90 or BLUP90IOD when their parameter file contains OPTION SNP_file filename.

Input files

* Parameter file (ie renf90.par) as created by RENUMF90 with genotype file specified with keyword SNP_FILE

* Genotype file

Field 1 - animal ID with format as in pedigree file
Field 2 - genotype with 0,1,2 and 5 (missing) or real values for gene content 0.12 …

Fields need to be separated by at least one space and Field 2 should be fixed format i.e. all rows of genotypes should start at the same column number!!!

 80   21101011002012011011010110111111211111210100
 8014 21110101511101120221110111511112101112210100
 516  21100101202252021120210121102111202212111101
 181  21110111112201120550200020101022212211111100

An utility program (illumina2pregs) is available to converts Illumina FinalReport and SNP_Map.txt in such format.

* Renumbered ID for genotypes

File gt.XrefID, where “gt” is genotype file.
This file is created by RENUMF90. Contains IDs renumbered sequentially (by RENUMF90) and the original IDs.
Rows of this file should be in the same order as in the genotype file !!!

Pedigree file from RENUMF90

1732 11010 10584 1 3 12  1 0 0 80
8474  8691  9908 1 3 12  1 0 0 8014
 406  8691  9825 1 3 12  1 0 2 516
9441  8691  8829 1 3 12  1 0 0 181

* Optional files:

Allele Frequencies (OPTION FreqFile)
Map file (OPTION map_file)
Weight file (OPTION weightedG)
G or its inverse, A22 or its inverse, etc,

As specified by respective OPTION lines.

OPTION map_file file

Read SNP map information from file.

Useful for check for Mendelian conflicts and HWE (with also OPTION sex_chr) and for GWAS (see PostGSF90 program)

The file should have a header with the following column names:
SNP_ID - identification of the SNP (alphanumeric)
CHR - chromosome number (numeric), starting from 1
POS - position bp (numeric)

Extra columns are possible

The first SNP in the Map file corresponds to the first SNP in the genotype file, and so on.

Other alphanumeric fields are optional.

If OPTION saveCleanSNPs is present fields are output.

Output files

GimA22i
By default PreGSf90 always create this file in binary format for use by later programs specifying OPTION readGimA22i.
With OPTION saveAscii this file can be stored as ASCII. Format: i,j, inv(G)-inv(A22)

freqdata.count
contains allele frequencies in the original genotype file, previous any QC.
format: snp number (related to the genotype file) and allele frequency

freqdata.count.after.clean
contains allele frequencies as used in calculations.
format: snp number (related to the genotype file), allele frequency and code of exclusion.

Codes:

1: Call Rate
2: MAF
3: Monomorphic
4: Excluded by request
5: Mendelian error
6: HWE
7: High Correlation with other(s) SNP

Gen_call_rate
contains list of animals excluded with call rate below the threshold.

Gen_conflicts
contains report of animals with Mendelian conflicts with their parents. The first part of the file includes:

1: a pair of animals compared
2: the progeny’s pedigree ID
3: the parent’s pedigree ID
4: the number of conflicts
5: proportion of markers conflicted

The second part of the file includes:

1: the genotyped animal’s position (row number) in the SNP file
2: the corresponding pedigree ID
3: the number of conflicts with this animal as a sire
4: the total number of comparisons with this animal as a sire
5: the number of conflicts with this animal as a dam
6: the total number of comparisons with this animal as a dam
7: the number of conflicts with this animal an individual (“the animal”)
8: the total number of comparisons with this animal as an individual.

Optional
The program can store files such as G or its inverse,
A22 or its inverse, or other reports from QC as specified by their respective OPTION lines.

Options for creation of genomic relationship Matrix (G)

The genomic relationship matrix can be created in different ways.

OPTION whichG x

specifies how G is created. The variable x can be:

1: G=ZZ'/k ; as in VanRaden, 2008 (default)
2: G=ZDZ'/n ; as in Amin et al., 2007; Leuttenger et al., 2003; where D=1/2p(1-p)
3: As 2 with modification UAR from Yang et al 2010

 OPTION whichfreq x

specifies what frequencies are used to create G in G=ZDZ'. The same frequencies are used to center Z and to scale in D. The variable x can be:

0: read from file freqdata or from other file as specified using OPTION FreqFile
1: 0.5
2: current calculated from genotypes(default)

OPTION whichfreqScale x

Specifies what frequencies are used to scale G in G=ZDZ'. Use this option if, for instance, want to use 0.5 for centering (using option above) but observed 2pq for scaling. The variable x can be:

0: read from file as specified in OPTION FreqFile
1: 0.5
2: current calculated from genotypes(default)

If option whichfreq is used, whichfreqScale will be considered only if it is read later in the parameter file

OPTION FreqFile  file

Reads allele frequencies from a file, for example based on allele frequencies calculated by estfreq.f90 (VanRaden, 2009)
format : snp, freq where snp corresponds to the index of SNP based on the same order that are in the genotype file.

If whichfreq is set to 0 the default file name is freqdata

OPTION whichScale x

specifies how G is scaled. The variable x can be:

1: 2Σ(p(1-p)) - VanRaden 2008 (default)
2: trace(ZZ')/n - Legarra 2009, Hayes 2009
3: correction - Gianola et al 2009

OPTION weightedG file

Reads weights from file to create weighted genomic relationship.
Weighting Z*= Z sqrt(D) ⇒ G = Z*Z*' = ZDZ'.
format: one column of weights in the same order as in the genotyped file.
Weights can be extracted from output of PostGSF90 program.

OPTION maxsnp x

Set the maximum length of string for reading marker data from file, only necessary if greater than default (400,000)

Quality Control (QC) for G

By default the following QC are run:

MAF
Call rate (SNPs & animals)
Monomorphic
Parent-progeny conflicts (SNPs & animals)

Parameters can be modified with the following options:

OPTION minfreq x

ignores all SNP with MAF < x,
default value 0.05

OPTION callrate x

ignores SNP with call rates < x (number of calls/number of individuals with genotypes)
default value 0.90

OPTION callrateAnim x

ignores Genotypes with call rates < x (number of calls/number of SNPs),
default value 0.90

OPTION monomorphic x

ignores monomorphic SNPs Optional parameter x can be used to enable (1) or disable (0) the check, default value 1

OPTION hwe x

check departure of heterozygous from Hardy-Weinberg Equilibrium.
By default, this QC is not run.
Optional parameter x set the maximum difference between observed and expected frequency
the default value is 0.15 as used in Wiggans et al., 2009 JDS

OPTION high_correlation x y

check for high correlated SNP
By default this QC is not run.
Optional parameter x set the maximum difference in allele frequency to check pair of locus.
If no value 0.025 is used. Decrease this value to speed up calculation
A pair of locus is considered highly correlated if all the genotypes are the same (0-0, 1-1, 2-2) or the opposite (0-2, 1-1, 2-0) (Wiggans et al 2009 JDS)
Optional parameter y can be used to set a threshold to check the number of identical samples out of the number of genotypes.
default values x=0.025 y=0.995

OPTION verify_parentage x

Verify parent-progeny Mendelian conflicts.
Write report to file Gen_conflicts
Option x could be:

0: no action
1: only detect
2: detect and search for an alternate parent (Not implemented). This option is now implemented in the SeekParentF90 program
3: detect and eliminate progenies with conflicts (default)

OPTION exclusion_threshold x

Number of parent-progeny exclusions as percentage all SNP to determine the wrong relationship.
default value 1

OPTION exclusion_threshold_snp x

Number of parent-progeny exclusions for each locus as a percentage, of pair of genotyped animals evaluated, to exclude an SNP from the analysis
default value 10

OPTION number_parent_progeny_evaluations x

Set the number of minimum pair of parent-progeny evaluations to exclude SNPs due to parent-progeny exclusions
default value 100

OPTION outparent_progeny

Create a full log file (Gen_conflicts_all) with all pairs of parent-progeny tested for Mendelian conflicts.

OPTION excludeCHR n1 n2 n3 ...

Exclude all SNP from chromosomes n1 n2 n3 …
Map file need to be provided. See OPTION map_file

OPTION includeCHR n1 n2 n3 ...

Include all SNP from chromosomes n1 n2 n3 …
Map file need to be provided. See OPTION map_file

OPTION excludeSample n1 n2 n3 ...

Exclude genotype samples n1 n2 n3 …
n1, n2… are position number of individuals in the genotype file.

OPTION sex_chr n

Chromosomes number equal or greater than n are not consider autosomes.
If selected this option, sex chromosomes will not be used for checking parent-progeny Mendelian conflicts, HWE and heritability of gene content
but they will be included for all remaining processes. If you want to remove sex chromosomes, which we do recommend, use OPTION excludeCHR.

Map file need to be provided. See OPTION map_file. However, note that sex chromosomes are not excluded

OPTION threshold_duplicate_samples x

Set the threshold to issue a warning for possible duplicate samples if G(i,j)/sqrt(G(i,i)*G(j,j)) > x
default value 0.9

OPTION high_threshold_diagonal_g x

Check and remove individuals with extreme high diagonals in the genomic relationship matrix.
If optional x is present set the threshold
default value 1.6

OPTION low_threshold_diagonal_g x

Check and remove individuals with extreme low diagonals in the genomic relationship matrix.
If optional x is present set the threshold
default value 0.7

OPTION plotpca

Plot the first two principal components to look for stratification in the population.

OPTION extra_info_pca file col

Reads from file the column col to plot with different colors for different classes.
The file should contain at least one variable with different classes for each genotyped individual, and the order should match the order of the genotypes file.
Variables could be alphanumeric and separated by one o more spaces.

OPTION calculate_LD

Calculate LD as the squared correlation of allele counts for two SNP

Results are stored in “ld_results”, columns: snp_i, chr_i, pos_i, freq_i, snp_j, chr_j, pos_j,freq_j, dist_ij, Rsq_ij

OPTION LD_by_chr

Calculate LD within chromosome

OPTION LD_by_pos x

Calculate LD within chromosome and windows of SNP based on position optional parameter x define with windows size in Bp, default value 200000

OPTION filter_by_LD x

Filter SNP with Rsq > threshold. Optional parameter x define the threshold. default value 0.8

OPTION thr_output_LD x

Threshold to print out Rsq between pair of SNP Optional parameter x define the threshold. default value 0.1

OPTION saveCleanSNPs

Save clean genotype data with excluded SNP and animals; based on the OPTIONS specified.
*_clean files are created:

gt_clean
gt_clean_XrefID

*_removed files are created:

gt_SNPs_removed
gt_Animals_removed

with gt being the genotype file

OPTION no_quality_control

This option turns off all quality control !!!!
Useful to speed up run when previous QC data was performed.

OPTION outcallrate

This option will print all call rate information for SNP and individuals.
The files callrate (SNP) and callrate_a (Individuals) are created.

OPTION h2_gene_content

It checks that the heritability of gene content is equal or close to 1 as described in Forneris et al. Genetics 199.3 (2015): 675-681. Markers with estimated h2<0.98 and significant p-values of the LRT (p<0.01) are discarded. In addition, heritability and status of each marker are written in file h2_gc_test.

The test is useful for homogenous populations (breeds) but theory does not hold for crossbred animals. This test uses explicitly inv(A22) so it is not suitable for very large populations.

Quality Control for Off-diagonal of A22 and G

OPTION thrWarnCorAG x

Set the threshold to issue a warning if cor(A22,G) < x
default value 0.5

OPTION thrStopCorAG x

Set the threshold to Stop the analysis if cor(A22,G) < x
default values 0.3

OPTION  thrCorAG x

Set the threshold to calculate corr(A22,G) for only A22 >= x
default values 0.02

Options for H

Options include different weights to create GimA22i as:

tau inv(alpha G + beta A22 + gamma I + delta) - omega inv(A22)

where the parameters are to scale the genomic info to be compatible with
the pedigree info, to make matrices invertible in the presence of clones,
and to control bias.

The defaults values are:

tau=1 alpha =0.95 beta = 0.05 gamma=0 delta=0 omega=1

Options to change these defaults are specified with:

OPTION TauOmega tau omega

OPTION AlphaBeta alpha beta

OPTION GammaDelta gamma delta

OPTION tunedG x

Scale G based on A22.
The variable x can be:

0: no scaling
1: mean(diag(G))=1, mean(offdiag(G))=0
2: mean(diag(G))=mean(diag(A22)), mean(offdiag(G))=mean(offdiag(A22)) (default)
3: mean(G)=mean(A22)
4: rescale G using Fst adjustment. As in Powell et al. (2010) or Vitezica et al. (2011)
9: arbitrary parameters: specify two additional numbers $a$ and $b$ in $a+b\mathbf{G}$ as OPTION tunedG 9 a b.

GWAS options (PostGSF90)

OPTION Manhattan_plot

Plot using GNUPLOT the Manhattan plot (SNP effects) for each trait and correlated effect.

OPTION Manhattan_plot_R

Plot using R the Manhattan plot (SNP effects) for each trait and correlated effect.
pdf images are created: manplot_St1e2.pdf, but other formats can be specified.
Note: t1e2 corresponds to trait 1, effect 2.

OPTION Manhattan_plot_R_format <format>

Control the format type to create images in R
format values accepted:

pdf (default)
png
tif

OPTION plotsnp <n>

Control the values of SNP effects to use in Manhattan plots

1: plot regular SNP effects: abs(val)
2: plot standardized SNP effects: abs(val/sd) (default)

OPTION SNP_moving_average n

Solutions for SNP effects will be by moving average of n adjacents SNPs.

OPTION windows_variance n

Calculates the variance explained by n adjacents SNPs.

When this option is used, the sum of variance explained by n adjacent SNPs (column 8 of snp_sol or column 3 of chrsnpvar) is not 100%.
This is because moving variance is used. If windows size is 20, the proportion of variance assigned to SNP 1 is calculated from SNP 1 to 20, for SNP 2 it goes from 2 to 21, for SNP 3 it goes from 3 to 22, and so forth.
A file called windows_variance has variance that sums to 100% in column 9.

OPTION windows_variance_mbp n

Calculates the variance explained by n Mb window of adjacents SNPs.

OPTION which_weight x

Generates a weight variable w to be used in the creation of a weighted genomic relationship matrix G=ZDZ'

1: w = y^2 * (2(p(1-p)))
2: w = y^2
3: experimental with the degree of brief
4: w = C**(abs(y)/sqrt(var(y's))-2) from VanRaden et al. (2009)
nonlinearA: same as 4

where y is the SNP solution, with scaled weight = w * nSnp/sum(w); and C is 1.125 by default (enable to change it using the second argument of the option line e.g. OPTION which_weight nonlinearA 1.2).

OPTION solutions_postGS x

Sets the file name for the solutions file, default solutions

OPTION postgs_trt_eff x1 x2

Computes postGS calculations (SNP solutions, variance explained, etc.) for only trait: x1 and effect: x2

OPTION snp_p_value

Computes p-values for GWAS from elements of the inverse of the Mixed Model Equations previously obtained from blupf90. This requires quite a lot of memory and time. For details see https://doi.org/10.1186/s12711-019-0469-3.

Output files for GWAS (postGSf90)

snp_sol

contains solutions of SNP and weights

1: trait
2: effect
3: SNP
4: Chromosome
5: Position
6: SNP solution
7: weight

if OPTION windows_variance is used

8: variance explained by n adjacents SNP.

if OPTION snp_p_value is used

9: variance of the SNP solution (used to compute the p-value)

chrsnp

contains data to create plot by GNUPLOT

1: trait
2: effect
3: values of SNP effects to use in Manhattan plots
4: SNP
5: Chromosome
6: Position

chrsnp_pval

contains solutions of SNP and weights

1: trait
2: effect
3: -log10(p-value)
4: SNP
5: Chromosome
6: Position in bp

chrsnpvar

contains data to create plot by GNUPLOT

1: trait
2: effect
3: variance explained by n adjacents SNP
4: SNP
5: Chromosome
6: Position

windows_segment

contains information of windows segments used to get variance explainded

1: label
2: window size (number of SNP)
3: Start SNP number for the window
4: End SNP number for the window
5: identification of window: (ChrNumber)'_'(startPositionMBP)
6: Start (ChrNumber)'_'(Position) for the window
7: End (ChrNumber)'_'(Position) for the window

windows_variance

contains variance explained for the biggest non-overlapping windows segments

1: trait
2: effect
3: Start SNP number or SNP name for the window
4: End SNP number or SNP name for the window
5: window size (number of SNP)
6: Start (ChrNumber)'_'(Position) for the window
7: End (ChrNumber)'_'(Position) for the window
8: identification of window: (ChrNumber)'_'(startPositionMBP)
9: variance explained by n adjacents SNP

snp_pred

contains allele frequencies + SNP effects

Graphic control files

Several files are created to generate graphics using either GNUPLOT or R
File names rules:

e.g.: 'Sft1e2.R'

first letter indicate
  'S' for solutions of SNP 
  'V' for variance explained
  
t1e2
  indicates that the file is for the trait 1 and the effect 2
  
filename extension
   .gnuplot
   .R
   .pdf 
   .png
   .tif

Misc options

OPTION num_threads_pregs n

Specify number of threads to be used with MKL-OpenMP for creation and inversion of matrices

OPTION num_threads_iod n

Specify number of threads to be used with MKL-OpenMP in BLUP90IOD program
for matrix-vector multiplications in the PCG algorithm

OPTION graphics s

Allows to generate plots with GNUPLOT program.
If present optional parameter s, set the time in seconds to show the plot.
Avoid using in batch programs !!!.

OPTION msg x

x set level of verbose; 0 minimal; 1 gives more diagnostic information

Save and Read options

OPTION saveAscii

Save files intermediate matrices (GimA22i,G,Gi,etc) files as ASCII (default=binary)

OPTION saveHinv

Save H inverse matrix in Hinv.txt
Format: i,j,val
with i,j, the index level for the additive genetic effect

OPTION saveAinv

Save A inverse matrix in Ainv.txt
Format: i,j,val
with i,j, the index level for the additive genetic effect

The following options use the information of the Original ID (alphanumeric) stored in the 10th column of the renaddxx.ped file, create by the RENUMF90 program.

OPTION saveHinvOrig

Save the H inverse matrix with original IDs

OPTION saveAinvOrig

Save the A inverse matrix with original IDs

OPTION saveDiagGOrig

Save diagonal of G matrix in DiagGOrig.txt
Format: id, val
with id the original IDs

OPTION saveGOrig

Save G matrix in G_Orig.txt
Format: id_i, id_j, val
with id_i and id_j the original IDs

OPTION saveA22Orig

Save A22 matrix in A22_Orig.txt
Format: id_i, id_j, val
with id_i and id_j the original IDs

OPTION saveGimA22iOrig

Save GimA22i matrix in GimA22i_Orig.txt
Format: id_i, id_j, val
with id_i and id_j the original IDs as stored in the renum file

OPTION readOrigId

Read information from renaddxx.ped file, Original ID and possibly year of birth for its use in parent-progeny conflict output.
Only need if not of the previous save*Orig are present.

OPTION saveGimA22iRen

Save GimA22i matrix in GimA22i_Ren.txt
Format: id_i, id_j, val
with id_i and id_j the IDs as read from the data/pedigree file

OPTION savePLINK

Save Genotype in PLINK format
files: toPLINK.ped and toPLINK.map

OPTION no_full_binary

Save the elements of half-matrix instead of the full matrix. It is useful to keep the compatibility with the older version of preGSf90.

Save and Read intermediate files

OPTION readGimA22i file

This option is used in analyses programs (BLUPF90,REMLF90, etc.) in order to use matrices stored in GimA22i file (default filename).
In general methods used to create and invert matrices in such programs dont use optimized version.
For large number of genotyped animals run first PreGSf90 and the read stored matrices in analyses programs.

Optional file can be used to specify other filename or path, for example OPTION readGimA22i ../../pregsrun/GimA22i.

Other intermediate matrices files can be stored for inspection or for it use in BLUPF90 programs as user_file type of random effect. See tricks and REMLF90 for details.

OPTION saveA22 
OPTION saveA22Inverse 
OPTION saveA22InverseOrig
OPTION saveG [all]
       if optional all is present all intermediate matrices for G are saved
OPTION saveGInverse 
OPTION saveGInverseOrig
OPTION saveGmA22 

OPTION readG [file]
OPTION readGInverse [file]
OPTION readA22 [file]
OPTION readA22Inverse [file]
OPTION readGmA22 [file]

DEPRECATED OPTIONS

OPTION chrinfo file