Simulation and analyzes of genomic SNP data

I.        Misztal, University of Georgia, 9/2009-8/2009

 

Program SNP_SIM is a simple simulator of SNP genotypes and phenotypes. The following can be varied:

-          Number of animals

-          Type of animals: with own records or progeny tested, with a given number of progeny with records

-          Number of SNP

-          Distribution of gene (SNP frequency

-          Number of QTL (SNP with nonzero effect)

-          Distribution of SNP effects

-          Heritability

All SNPs are assumed independent.

 

Program GEN_SEL  analyses the genomic/phenotypic data as simulated by SNP_SIM. All SNP effects are treated as random with the variance ratio provided as a parameter. The output includes correlation of simulated and predicted breeding values on the original and on an independent sample. Also available are QTL effects as simulated and estimates of marker effects.

 

GEN_SEL uses two efficient solvers as in Legarra and Misztal (2008).  

 

The programs are written in Fortran and can be used as starting point for more complicated analysis or for a lesson in Fortran 95 programming.

 Please note that pseudo-random number generator is quite simple.

 

 

Example

 

Parameter file par_SNPSIM

Number of SNP

4000

Number of QTL (number of SNP with nonzero effect)

100

Number of progeny; 0 if only own records

500

Number of records

2000

Lower an upper range of SNP frequencies; upper must be <= 0.5

0.1 0.5

Range of SNP values (from 1 to 100000)

20

Additive variance

200

Heritability (between 0.001 and 0.999)

0.3

Mean

100

Random seed

2578

 

Data simulation

D:\snp_sim

Program SNP_SIM 1.2(I. Misztal, UGA)

 using parameters in file:par_snpsim

   4000  SNP

   100  of SNP have nozero effects

   500  progenies per sire

   2000  records

 Minor SNP frequencies range:   0.100000  0.500000

 SNP range of values:  20.0000

 Additive variance=  200.000

 h2=  0.300000

 Mean=  100.000

 Random seed=  2578

Simulated 2000 records with 4000 SNP effects

Simulated parameters are in file SNP-setup

Simulated records are in file SNP-data

 

File SNP-setup

n_snp  n_qtl nrec nprog nrec

    4000     100    2000     500    2000

 var_a,var_e, sim_var

    200.00    466.67   2481.66

 snp_freq

 0.24 0.25 0.32 0.12 0.14 0.16 0.36 0.44 0.15 0.25

 0.27 0.33 0.30 0.40 0.32 0.18 0.42 0.17 0.48 0.16

………………………………………………

 

File SNP_data

1         71.26  -24.68 2111211111112221111112211…….

        2    99.40   -0.23 1221111211112211211121121……

2         83.91  -21.84 111121111111121211112111…….

 

 

Data analysis

 

D:\gen_sel

Program GEN_SEL 1.1 (I. Misztal, UGA)

 variance ratio?

100

 Read 2000 records with 4000 SNP effects

  5 elements of XPX:   2000.0  1308.0  1253.0  1124.0  1553.0

Iteration started at     2.56 seconds

convergence of 0.60E-14 reached in    78 rounds and 22.45 seconds

 

 

Results for training data set

                    y           simulated_bv     genomic_bv

 mean              94.38          -5.61          -3.43

 var              206.46         201.84         183.59

 corr(s_bv,g_bv):0.988

 

 

Results for independent data set

                    y           simulated_bv     genomic_bv

 mean              94.74          -5.21          -3.06

 var              211.14         205.72          83.25

 corr(s_bv,g_bv):0.692

 

Finished in    27.23 seconds

 

 

File SNP_results

Estimated mean:      97.80

Simulated and estimated genomic values

     1     -5.68     -2.27

     2      5.51      2.46

     3     -5.34     -2.27

……..

 

 

 

 

Sample exercise

Note that this exercise SNP_SIM  uses a number of assumptions. The most important one is that SNP markers are on the gene. Therefore any conclusions from the simulation are approximate, and in certain cases may be wrong.

Simulate SNP data using program snp_sel. The initial parameters can be:

500 records

1000 SNP

200 QTL

genetic variance: 100

heritability: 0.2

Range of gene frequencies for the minor allele: 0.1 to 0.5

Range of SNP effects for the minor allele:5.

 

Run genomic predictions using program gen_sel. Find which variance ratio provides the highest correlation between true and estimated breeding value for the test data set. Examine simulated and estimated SNP effects.

 

Change one or two parameters of simulation and create graphs for the correlations. For example, find the correlations for several values of SNP or QTL count. Also, differences in correlations when using different number of daughters per sire, or only records for animals. Possible variations could include:

 

h2 from 0.02 to 0.5

nprog from 0 to 10 to 500

nrec from 100 to 5000

n_snp from 500 to 3000

n_qtl from to 1000

range of SNP values from 2 to 1000.

lower range of SNP frequencies form 0.1 to 0.49.

 

The optimal variance ratio would maximize correlations between predicted and simulated BV. It would be different for different parameters.

 

 If results vary greatly, try replicates where each run involves a different seed for pseudo-random number generator. With the same parameters and the same seed, every run provides identical data.

 

Peculiarities of the program

The program does not simulate meiosis or mating; just the base population. It generates twice the number of records requested; one is used for estimation and the second only for prediction. With nprog>0, phenotypic records generated would be similar to DYD + the mean.