Estimate ancestry coefficients and run genome scans for selection

tess3 is the main function of the tess3r package. It runs a graph-based nonnegative matrix factorization algorithm that includes geographic data in the estimation of spatial population structure. The function requires individual genotypes, geographic coordinates, and it can be run for multiple values of the number of ancestral populations. In addition, the function uses the estimates of ancestry coefficients to compute an Fst statistic for each locus, and to return test significance values for a null hypothesis of selective neutrality. See the references for more details.

tess3(X, XProba = NULL, coord, K, ploidy, lambda = 1, rep = 1, W = NULL,
  method = "projected.ls", max.iteration = 200, tolerance = 1e-05,
  openMP.core.num = 1, Q.init = NULL, mask = 0, algo.copy = TRUE,
  keep = "best", verbose = FALSE)

Arguments

X	a matrix of individual genotypes. This matrix must have \(n\) rows and \(L\) columns where \(n\) is the number of individuals and \(L\) is the number of loci. The entries of this matrix are integers between 0 and ploidy, that correspond to the number of derived/reference alleles observed at each locus. If `NULL`, the matrix of genotype likelihood, `XProba`, is used.
XProba	A matrix which contains individual genotype likelihoods (probabilities) for each locus. This matrix must contain \(n\) rows and \((ploidy + 1)L\) columns where \(n\) is the number of individuals, and \(L\) is the number of loci. The entries of this matrix are numeric values ranging between 0 and 1, and corresponding to genotype probabilities for each locus. If `NULL`, deterministic values are computed from the genotypic matrix `X`. See the references for more details.
coord	a matrix of size \(n \times 2\) where \(n\) is the number of individuals encoding the longitude and latitude of each individual (numeric system).
K	an integer or a vector of integers corresponding to the number of ancestral populations.
ploidy	an integer corresponding to ploidy of the studied organism. Haploids have ploidy = 1, diploids have ploidy = 2, etc.
lambda	a numeric value for the spatial regularization parameter. The default value lambda = 1 attributes equal weights to the loss function and to the penalty function.
rep	integer. The number of time the algorithm will be repeated for each value of `K`.
W	a matrix which corresponds to the graph weightings. If NULL, W is computed as `W[i,j] = exp(-(coord[i] - coord[j])^2 / sigma^2)`, where `coord[i]` represents the geographic coordinates for individual i, and where `sigma` is equal to 5 percent of the average geographic distance between individuals.
method	a character string `"projected.ls"` or `"qp"`. If `"projected.ls"`, an alternating projected least squares algorithm is used. If `"qp"`, an alternating quadratic programing algorithm is used. See references for more details
max.iteration	the maximum number of iterations of the optimization algorithm.
tolerance	a numeric value corresponding to the stopping criteria of the optimization algorithm.
openMP.core.num	integer representing the number of cores used by the optimization algorithm. It requires that the openMP library is installed in your OS (default for macOS is no).
Q.init	a matrix for initial values of ancestry coefficients for the algorithm. The default value is a random matrix.
mask	If not `NULL`, this numeric value is the proportion of masked data when computing the cross-validation criterion.
algo.copy	boolean. If TRUE data is copied to speed up the algorithm.
keep	If `"best"`, only the result with the lowest `rmse` score will be kept for each value of `K`. If `"all"`, all results will be kept and returned for each value of `K`. The second option uses more space in memory.
verbose	If `TRUE` more information is printed.

Value

An object of class tess3 which corresponds to a list of length length(K). Each element of this list has the following attributes

K: the number of ancestral populations
tess3.run: if keep = "best", the tess3Main result with the lowest value of the rmse (loss) function. If keep = "all", a list of tess3Main results for each repetition
rmse: root mean squared error between the genotypic matrix XProba and the fitted matrix for each program repetition
crossentropy: cross-entropy between the genotypic matrix XProba and the fitted matrix for each program repetition
crossvalid.rmse: root square mean error between the masked values of genotypic matrix XProba[masked] and their fitted values for each repetition. If mask is FALSE, then NULL.
crossvalid.crossentropy: cross-entropy between the masked values of genotypic matrix XProba[masked] and their fitted values for each repetition. If mask is FALSE, then NULL.

Methods available for this class:

References

http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12471/full Caye, Kevin et al. (2016) Fast Inference of Individual Admixture Coefficients Using Geographic Data. bioRxiv doi:10.1101/080291. http://biorxiv.org/content/early/2016/10/12/080291

Examples

library(tess3r)

# Arabidopsis thaliana data set
data(data.at)
genotype <- data.at$X
coordinates <- data.at$coord

# Running the tess3 function
tess3.obj <- tess3(X = genotype, coord = coordinates, K = 1:4,
                   method = "projected.ls",
                   ploidy = 1)
#> == Computing spectral decomposition of graph laplacian matrix: done
#> ==Main loop with 1 threads: done
#> == Computing spectral decomposition of graph laplacian matrix: done
#> ==Main loop with 1 threads: done
#> == Computing spectral decomposition of graph laplacian matrix: done
#> ==Main loop with 1 threads: done
#> == Computing spectral decomposition of graph laplacian matrix: done
#> ==Main loop with 1 threads: done

# Plot error
plot(tess3.obj, pch = 19, col = "blue",
     xlab = "Number of ancestral populations",
     ylab = "Cross-validation score")

# Retrieve the Q-matrix for K = 3 clusters
q.matrix <- qmatrix(tess3.obj, K = 3)

## STRUCTURE-like barplot for the Q-matrix
barplot(q.matrix, border = NA, space = 0,
       xlab = "Individuals", ylab = "Ancestry proportions",
       main = "Ancestry matrix") -> bp
#> Use CreatePalette() to define color palettes.
axis(1, at = 1:nrow(q.matrix), labels = bp$order, las = 3, cex.axis = .4)

## Spatial interpolation of ancestry coefficient
my.colors <- c("tomato", "orange", "lightblue")
my.palette <- CreatePalette(my.colors, 9)
plot(q.matrix, coordinates, method = "map.max", interpol = FieldsKrigModel(10),
     main = "Ancestry coefficients",
     xlab = "Longitude", ylab = "Latitude",
     resolution = c(500,500), cex = .4,
     col.palette = my.palette)
#> This function required to attach maps namespace.

## Genome scan p-values for K = 3
p.values <- pvalue(tess3.obj, K = 3)
hist(p.values, col = "lightblue")

## Manhatan plot
plot(p.values, main = "Manhattan plot",
    xlab = "Locus id",
    ylab = "-log10(P-values)",
    cex = .3, col = "grey")

Estimate ancestry coefficients and run genome scans for selection

Arguments

Value

References

See also

Examples

Contents