Principal Component Analysis for outlier detection

pcadapt performs principal component analysis and computes p-values to test for outliers. The test for outliers is based on the correlations between genetic variation and the first K principal components. pcadapt also handles Pool-seq data for which the statistical analysis is performed on the genetic markers frequencies. Returns an object of class pcadapt.

pcadapt(
  input,
  K = 2,
  method = "mahalanobis",
  min.maf = 0.05,
  ploidy = 2,
  LD.clumping = NULL,
  pca.only = FALSE,
  tol = 1e-04
)

# S3 method for pcadapt_matrix
pcadapt(
  input,
  K = 2,
  method = c("mahalanobis", "componentwise"),
  min.maf = 0.05,
  ploidy = 2,
  LD.clumping = NULL,
  pca.only = FALSE,
  tol = 1e-04
)

# S3 method for pcadapt_bed
pcadapt(
  input,
  K = 2,
  method = c("mahalanobis", "componentwise"),
  min.maf = 0.05,
  ploidy = 2,
  LD.clumping = NULL,
  pca.only = FALSE,
  tol = 1e-04
)

# S3 method for pcadapt_pool
pcadapt(
  input,
  K = (nrow(input) - 1),
  method = "mahalanobis",
  min.maf = 0.05,
  ploidy = NULL,
  LD.clumping = NULL,
  pca.only = FALSE,
  tol
)

Arguments

input: The output of function read.pcadapt.
K: an integer specifying the number of principal components to retain.
method: a character string specifying the method to be used to compute the p-values. Two statistics are currently available, "mahalanobis", and "componentwise".
min.maf: Threshold of minor allele frequencies above which p-values are computed. Default is 0.05.
ploidy: Number of trials, parameter of the binomial distribution. Default is 2, which corresponds to diploidy, such as for the human genome.
LD.clumping: Default is NULL and doesn't use any SNP thinning. If you want to use SNP thinning, provide a named list with parameters $size and $thr which corresponds respectively to the window radius and the squared correlation threshold. A good default value would be list(size = 500, thr = 0.1).
pca.only: a logical value indicating whether PCA results should be returned (before computing any statistic).
tol: Convergence criterion of RSpectra::svds(). Default is 1e-4.

Value

The returned value is an object of class pcadapt.

Details

First, a principal component analysis is performed on the scaled and centered genotype data. Depending on the specified method, different test statistics can be used.

mahalanobis (default): the robust Mahalanobis distance is computed for each genetic marker using a robust estimate of both mean and covariance matrix between the K vectors of z-scores.

communality: the communality statistic measures the proportion of variance explained by the first K PCs. Deprecated in version 4.0.0.

componentwise: returns a matrix of z-scores.

To compute p-values, test statistics (stat) are divided by a genomic inflation factor (gif) when method="mahalanobis". When using method="mahalanobis", the scaled statistics (chi2_stat) should follow a chi-squared distribution with K degrees of freedom. When using method="componentwise", the z-scores should follow a chi-squared distribution with 1 degree of freedom. For Pool-seq data, pcadapt provides p-values based on the Mahalanobis distance for each SNP.