Variance component estimation with longitudinal data : a simulation study with alternative methods

A pedigree structure distributed in three different places was generated. For each offspring, phenotypic information was generated for five different ages (12, 30, 48, 66 and 84 months). The data file was simulated allowing some information to be lost (10, 20, 30 and 40%) by a random process and by selecting the ones with lower phenotypic values, representing the selection effect. Three alternative analysis were used, the repeatability model, random regression model and multiple-trait model. Random regression showed to be more adequate to continually describe the covariance structure of growth over time than single-trait and repeatability models, when the assumption of a correlation between successive measurements in the same individual was different from one another. Without selection, random regression and multiple-trait models were very similar.


INTRODUCTION
Data obtained by successive measurements in one experimental unit or individual, the so called longitudinal data, can be analyzed using different strategies.In genetic breeding an interesting approach is to use random regression.Henderson Junior (1982) proposed the theory regarding the random regression coefficients, based on the principle that if a regression coefficient pertaining to each individual in an experiment is defined, and if the individuals are a random sample of the population, then the regression coefficients must be considered random.This methodology has been used to model traits that are measured over time, such as growth traits.In order to adopt this model, the measurements over time are considered to be successive points on a continuous trajectory, and hence a prediction of parameters is permitted, also for points (ages) where measurements have not been done.To describe the fixed curve for all the individuals, as well as the individual ones, covariance functions (Kirkpatrick et al. 1990) that describe the covariance structure between ages can be used.In this context, covariance functions using Legendre's polynomials have been used because they make calculations and interpretations easier.
Another alternative of analysis for longitudinal data is the repeatability model, but it is necessary to assume that successive measurements in the same individual with the present correlation equal to the unit.However, such assumptions are not always valid ones, because for growth traits, for example, successive measurements are always more strongly correlated than those more distant in time.
Variance component estimation with longitudinal data: a simulation study with alternative methods In a multi-trait approach we do not need to make assumptions regarding the covariance structure, which means that a non structured covariance matrix is used.Consequently, this alternative requires the estimation of a large number of parameters, and may cause many computational difficulties.
The objective of this paper was to study variance components and genetic parameter estimations using random regression, repeatability and the multi-trait models with longitudinal data with different levels of lost information present in the data.

MATERIAL AND METHODS
The data was simulated such that it represented crossing structures distributed in three different places.A progeny test was simulated, by crossing 30 males each with three different females, with each crossing generating ten different offspring.This crossing structure was simulated in three different places.The fixed effect of place was created to present non significant statistical differences (similar means and variances).For each offspring there were phenotypic data at five different ages, 12, 30, 48, 66 and 84 months, resulting in 120 relatives (30 males and 90 females), giving a total of 1,020 individuals, with 900 of them presenting information at five different ages, which resulted in 4,500 observations of production.
The simulated longitudinal data can be described by the mixed linear model defined as: y = Xβ β β β β + Za + Wp + ε ε ε ε ε, where y is the vector of observations from each individual; β β β β β represents the vector of fixed effect and of the general curve parameters for all the individuals; X is the incidence matrix of fixed effects levels and the regression variables, corresponding to the standardized ages associated to the Legendre's polynomials; a and p are the random vectors of random regression solutions of the additive genetic effect and of random regression of the permanent environmental effects, respectively; Z and W are the matrices that associate the standardized ages by the Legendre's Polynomials to the a and p vectors, respectively.ε ε ε ε ε is the random temporary environmental effect vector.
Assuming that the vectors y, a, p e e have a normal distribution then E(a) = 0, and it's variance is where Ka is a covariance matrix between the random regression coefficients of the additive genetic effect and A is a matrix that indicates the degree of the individuals relationship, of equal dimension to the total number of individuals (N).The vector p has E(p) = 0 and a variance V(p) = I⊗ ⊗ ⊗ ⊗ ⊗Kp = P, where Kp is a covariance matrix between random regression coefficients of the permanent environmental effect and I is an identity matrix of equal dimensions to the number of individuals with information (n).The fixed effects and the general regression curve for all the individuals are associated to the b vector.The data simulation was carried out by using a second degree polynomial model, with Legendre's orthogonal polynomials to describe both the fixed trajectory and the random effects of the model.The temporary residual effect was assumed to have a normal distribution with an average equal to zero and a variance of σ σ σ σ σ 2 ε ε ε ε ε =2.2 unity 2 and the Ka and Kp covariance matrices of random regression coefficients of the additive genetic and permanent environmental effect, were respectively defined as follows, and .The β β β β β vector containing the fixed curve was obtained by [15.14 26.82 35.72 41.84 45.1] represents the average production at months 12, 30, 48, 66 and 84, respectively.The φ φ φ φ φ matrix represents the multiplication between the standardized ages matrix (M) with the one that describes the three first Legendre's polynomials [47.8832 12.2381 -2.3636], which is the vector of solutions for the fixed curve for all the individuals in the study representing the intercept and the linear and quadratic coefficients of the equation, respectively.If UT 1/2 a and UT 1/2 p are the Cholesky decomposition of the covariance matrices of the random regression coefficients for additive genetic and permanent environmental effects, respectively, and SI Araujo et al.
A is a numerator relationship matrix between individuals, where AT 1/2 a is the Cholesky decomposition of this relationship matrix, then the vector y (phenotypes) c o n t a i n i n g t h e " i " t h t r a i t s ( a g e ) i s d e f i n e d a s y = β β β β β + AT 1/2 a Z a U T 1/2 a + Zp UT 1/2 p + e.After the complete data file simulation 450 individuals with information were randomly chosen, which had the production information at 84 months deleted giving a second file with a 10% loss of information.Subsequently, the same individuals had deleted the production information at months 66, 48 and 30, giving new data files with 20%, 30% and 40% losses of information, respectively.The elimination of information from individuals using a random process aimed at studying the efficiency of the common methods used in longitudinal data analysis, when applied to incomplete data.Our goal was to study the lost observations effects.
Again, with the complete data file, the individual elimination process was made.However in this case, the individuals whose information was deleted were the ones who had the least phenotypic value and represented samples with the selection effect, with the objective to study the efficiency of the common strategies analysis used in longitudinal data analysis, regarding the effect of lost observations by the selection effect.
Production data in each situation were analyzed to estimate the (co)variance components and the genetic parameters by different methods.Both the simulation and manipulation of the data files were realized with the Statistical Analysis System version 8.1 (SAS 1990, SAS Institute Inc.).
Considering an analysis where each age is like a repeated measurement in the same individual, the repeatability model is described as y = Xb + Za + Wp + e, where y is a n x 1vector of n observations (production), X is an incidence matrix of local fixed effects and of the age (co) variable for each individual's production, Z and W are the incidence matrices of the individual's random additive genetic and permanent environmental effects associated to vectors a and p of the additive genetic and permanent environmental values, and e is the residual vector with the same dimension of y.
The random regression model used considered each age as a point in a continuous trajectory, fitting covariance functions both for the additive genetic effect and for the permanent environmental one, where both functions used the first three Legendre's polynomials, characterizing a second degree polynomial function.This model can be described as y = Xβ β β β β + Za + Wp + e; where y is the vector of n observations of production at each age; X is the incidence matrix of local fixed effects and the standardized ages between -1 to +1 that describes the averaged trajectory of all individuals by the Legendre's polynomials; β β β β β is the solutions vector of local fixed effects and of the fixed regression of all the individuals; Z and W are diagonal block matrices with the standardized ages associated to the random regression coefficients of the additive genetic and permanent environmental random effects for each individual, respectively; a and p are random regression model vectors of additive genetic and permanent environmental effects, respectively, for each individual.
The vector e represents the random temporary environmental effects.
In the multiple trait analysis each age was considered a distinct characteristic.The model used is described as y = Xβ β β β β + Za + e, where : . y i is the response variable vector; X i is the incidence matrix of the local fixed effects; β β β β β is the local fixed effects levels solutions; Z is an incidence matrix of random effects; a and e are the additive genetic and residual random effect vectors, respectively, at the i'th age.To compare the results from different data files with different strategies of analysis, genetic parameters and variances estimates were evaluated.In some cases the likelihood ratio test was used (Rao 1973).
The analyses were all processed with the software DFREML Version 3.0 (Meyer 1998).

RESULTS AND DISCUSSION
With the repeatability model analysis and random loss of information, the heritability estimates were similar Variance component estimation with longitudinal data: a simulation study with alternative methods between the different ages (Table 1).However, when the loss of information occurred by selection, there was a reduction in the additive genetic variance, according to the loss of information level.
The additive genetic variance estimates at each age, for the complete data file with loss of information, with and without selection, by using random regression models are presented in Table 2.
The loss of information obtained with the random process did not modify the heritability estimates at any loss of information level (Table 3).According to Dal Zotto (2000), in random regression models the minimum number of observations to be considered for each permanent environmental effect level must be equal to the number of parameters used to describe the data trajectory plus one.According to Schaeffer and Dekkers (1994), random regression models can also be used when individuals have only one piece of information.Results obtained in this study are in agreement with these authors.
When the loss of information was made by eliminating the individuals with the least phenotypic values, the genetic and environmental variance estimates reduced, causing smaller values for the heritability estimates.With selection, only the random regression heritability estimates with a 10% loss of information were similar to the complete data file analysis estimates.Resende et al. (2001), using random regression models to describe the diameter at breast height within one to seven year old Eucalyptus urophylla trees, found that the heritability estimates were very similar to that obtained by single-trait models before three years old.For advanced ages the estimates were smaller.The authors discussed that single-trait analysis overestimated the parameter estimates because of the phenotypic variance reduction, due to the least vigorous individual death, became naturally selected for population adaptation.Matheson and Raymond (1984), cited by Resende et al. (2001), found heritability estimates in two Pinus radiata populations equal to 0.12 and 0.24.The estimates assumed values of 0.21 and 0.33 after the worst plants were eliminated, respectively.These results were opposite to those obtained in this study, probably due to the sample size, since our data was simulated considering a progeny test with only two generations; the ancestry and the offspring.Therefore, the elimination of information from smaller phenotypic values and consequently the elimination of the smaller genotypic values changed the data structure.

Loss of information without selection (%)
Loss  The complete data files were analyzed by two random regression models, where the difference between them was in the temporary environmental effect variance.In the first model the variances were constant and in the second model each age had a different variance.Comparing the estimates of variance components and heritability, and also comparing the models likely function values (Table 4), its acceptance at a 5% significance level indicates that the temporary environmental variances being homogeneous or heterogeneous did not influence the model fit.
In the data analysis considering each age as a trait by multi-trait models analyzing traits side by side (bitrait model) the same results were observed.The heritability estimates for the complete and incomplete data files with and without selection are presented in Table 5.These values represent the minimum and maximum estimates obtained by the analysis of traits side by side.The selection caused a variability reduction since the loss of information increased, in other words this model was very sensitive to the selection effect.However, all the heritability estimates were similar when the loss of information was not made by a selective process.
Generally, independent from the analysis strategy used (single-trait, multi-trait or random regression models) the selection effect caused the least discrimination among individuals and reduced the variability between them.The genetic variance was reduced as a result of less variability between individuals and consequently smaller heritability estimates.With a 10% loss of information the heritability values did not change, with a 20% loss the heritability values reduced at the last two ages, with a 30% loss the reduction in heritability values occurred at the three last ages and with a 40% loss the reduction occurred at the four last ages.However with a 10% loss of information with selection the random regression model analysis was the best alternative.
Knowing that the loss of information could change the temporary environmental effects, the data was also analyzed with a 40% loss of information with selection, but assuming heterogeneous variance for the temporary environmental effect at each age.Moreover the complete data file was also analyzed to check the permanent environmental effect of variance homogeneity effects.Comparing the variance components and heritability estimates and comparing the models using the likelihood function values (Table 6) it was observed that the variance heterogeneity for both the complete and the select data file did not change the data variation description.Therefore the loss of information did not change the temporary environmental effect.the multi-trait model at the same loss of information level presented underestimated values at the last age studied.Therefore as the random regression models had used covariance functions that give a continuous covariance structure of random effects associated to the analyzed character description, it can be affirmed that they can better express the mixed linear model's random effects variance that describe longitudinal data.By considering that the trait in study can change in time, this strategy is more realistic than the repeatability model, and because it uses fewer parameters it becomes more attractive than the multi-trait models that can be prohibited in practical applications with a large number of parameters.Schaeffer and Wilton (1998) discussed that in some cases when all the traits are observed in each individual and the traits heritability are similar and all of them are positively correlated, the model analysis that considers multiple traits would not offer a significant increase in the genetic evaluation accuracy.

CONCLUSIONS
When the assumption is that the correlation between successive measurements in a single individual is equal to one which is invalid, then random regression models in the genetic analysis can provide better results than repeatability models.
Random regression models yield covariance and genetic parameter estimates similar to those obtained by multiple-trait models, however, with the use of fewer parameters in the model, which in practice is an advantage.
In small population samples submitted to the selection effect, multi-trait models were more susceptible to the selection bias than the random regression models analysis, but additional studies are necessary for more conclusive findings.
Considering the continuous nature of the dependent variable, random regression models must be preferred to single trait models in high loss of information level data files.
E[y] = Xβ β β β β and V[y] = ZGZ'+R where G = A⊗ ⊗ ⊗ ⊗ ⊗G o is the genetic additive variance and covariance matrix and R = I⊗ ⊗ ⊗ ⊗ ⊗R o where R o is the residual variance and covariance matrix.

Table 1 .
Additive genetic variance, permanent environmental variance, temporary environmental variance, environmental variance, and heritability estimates, by using the repeatability model, with and without selection

Table 2 .
Estimates of additive genetic variance in each age for the complete and incomplete data file, with and without selection, by using random regression models SI Araujo et al.

Table 3 .
Estimates of heritability in each age for the complete and incomplete data file, with and without selection, by using random regression model