Simulation
and analyzes of genomic SNP data
I. Misztal, University of Georgia, 9/2009-8/2009
Program SNP_SIM is a simple simulator of SNP genotypes and phenotypes. The following can be varied:
- Number of animals
- Type of animals: with own records or progeny tested, with a given number of progeny with records
- Number of SNP
- Distribution of gene (SNP frequency
- Number of QTL (SNP with nonzero effect)
- Distribution of SNP effects
- Heritability
All SNPs are assumed independent.
Program GEN_SEL analyses the genomic/phenotypic data as simulated by SNP_SIM. All SNP effects are treated as random with the variance ratio provided as a parameter. The output includes correlation of simulated and predicted breeding values on the original and on an independent sample. Also available are QTL effects as simulated and estimates of marker effects.
GEN_SEL uses two efficient solvers as in Legarra and Misztal (2008).
The programs are written in Fortran and can be used as starting point for more complicated analysis or for a lesson in Fortran 95 programming.
Please note that pseudo-random number generator is quite simple.
Example
Parameter file par_SNPSIM
Number of SNP
4000
Number of QTL (number of SNP with nonzero
effect)
100
Number of progeny; 0 if only own records
500
Number of records
2000
Lower an upper range of SNP frequencies;
upper must be <= 0.5
0.1 0.5
Range of SNP values (from 1 to 100000)
20
Additive variance
200
Heritability (between 0.001 and 0.999)
0.3
Mean
100
Random seed
2578
Data simulation
D:\snp_sim
Program SNP_SIM 1.2(I. Misztal, UGA)
using parameters in file:par_snpsim
4000 SNP
100 of SNP have nozero effects
500 progenies per sire
2000 records
Minor SNP frequencies range: 0.100000 0.500000
SNP range of values: 20.0000
Additive variance= 200.000
h2= 0.300000
Mean= 100.000
Random seed= 2578
Simulated 2000 records with 4000 SNP effects
Simulated parameters are in file SNP-setup
Simulated records are in file SNP-data
File SNP-setup
n_snp n_qtl nrec nprog nrec
4000 100 2000 500 2000
var_a,var_e, sim_var
200.00 466.67 2481.66
snp_freq
0.24 0.25 0.32 0.12 0.14 0.16 0.36 0.44 0.15 0.25
0.27 0.33 0.30 0.40 0.32 0.18 0.42 0.17 0.48 0.16
………………………………………………
File SNP_data
1 71.26 -24.68 2111211111112221111112211…….
2 99.40 -0.23 1221111211112211211121121……
2 83.91 -21.84 111121111111121211112111…….
Data
analysis
D:\gen_sel
Program GEN_SEL 1.1 (I. Misztal, UGA)
variance ratio?
100
Read 2000 records with 4000 SNP effects
5 elements of XPX: 2000.0 1308.0 1253.0 1124.0 1553.0
Iteration started at 2.56 seconds
convergence of 0.60E-14 reached in 78 rounds and 22.45 seconds
Results for training data set
y simulated_bv genomic_bv
mean 94.38 -5.61 -3.43
var 206.46 201.84 183.59
corr(s_bv,g_bv):0.988
Results for independent data set
y simulated_bv genomic_bv
mean 94.74 -5.21 -3.06
var 211.14 205.72 83.25
corr(s_bv,g_bv):0.692
Finished in 27.23 seconds
File
SNP_results
Estimated mean: 97.80
Simulated and estimated genomic values
1 -5.68 -2.27
2 5.51 2.46
3 -5.34 -2.27
……..
Sample exercise
Note that this exercise SNP_SIM uses a number of assumptions. The most important one is that SNP markers are on the gene. Therefore any conclusions from the simulation are approximate, and in certain cases may be wrong.
Simulate SNP data using program snp_sel. The initial parameters can be:
500 records
1000 SNP
200 QTL
genetic variance: 100
heritability: 0.2
Range of gene frequencies for the minor allele: 0.1 to 0.5
Run genomic predictions using program gen_sel. Find which variance ratio provides the highest correlation between true and estimated breeding value for the test data set. Examine simulated and estimated SNP effects.
Change one or two parameters of simulation and create graphs for the correlations. For example, find the correlations for several values of SNP or QTL count. Also, differences in correlations when using different number of daughters per sire, or only records for animals. Possible variations could include:
h2 from 0.02 to 0.5
nprog from 0 to 10 to 500
nrec from 100 to 5000
n_snp from 500 to 3000
n_qtl from to 1000
range of SNP values from 2 to 1000.
lower range of SNP frequencies form 0.1 to 0.49.
The optimal variance ratio would maximize correlations between predicted and simulated BV. It would be different for different parameters.
If results vary greatly, try replicates where each run involves a different seed for pseudo-random number generator. With the same parameters and the same seed, every run provides identical data.
Peculiarities of the program
The program does not simulate meiosis or mating; just the base population. It generates twice the number of records requested; one is used for estimation and the second only for prediction. With nprog>0, phenotypic records generated would be similar to DYD + the mean.