Skip to contents

THIS VIGNETTE IS CURRENTLY UNDER DEVELOPMENT, SO ITS CONTENT IS PROVISIONAL.

Introduction

When a statistical association is observed between an external exposure (XX) and an individual outcome (YY), one or more intermediate variables (M) (such as gene expression or epigenetic changes) may mediate this effect. Identifying and assessing the mediating role of these variables in the effect of XX on YY is crucial for deciphering underlying causal mechanisms in epidemiological and clinical research. This process, known as mediation analysis, involves studying mediator variables to define the causal structure between XX and YY. The mediated effect, termed the indirect effect, is equal to the effect of XX on YY mediated through (M), to distinguish from the direct effect of XX on YY unexplained by (M).

The R package hdmax2 is dedicated to high-dimensional mediation analyses. For mediation analyses, the necessary data includes one (or more) exposure variable(s), a matrix of potential mediators, and an outcome variable.

The term “high dimensional” refers to the large quantity of potential mediators among which we seek to identify the actual mediators.

The fundamental concept behind HDMAX2 methods is to use a latent factor mixed regression model for estimating unobserved latent factors while conducting high-dimensional association analysis. HDMAX2 also implements a novel procedure known as the max-squared test to assess the statistical significance of potential mediators. Finally, HDMAX2 enables the calculation of an overall indirect effect from a single model that includes all mediators simultaneously. This approach therefore takes into account correlations between mediators.

The hdmax2 program has been enriched with numerous features, including the ability to manage various types of variables in the exposure (continuous, binary, categorical, and multivariate), as well as the capability to incorporate binary outcomes. This package enables users to:

  • investigate associations between the variables XX, (M), and YY,

  • compute the mediated effect for each potential mediator,

  • assess the overall indirect effect for the total model,

  • and visualize these results.

The package hdmax2 is designed to accept :

  • One or several Exposure variables XX consisting of univariate data, which can be continuous, binary or categorical as well as multivariate exposure. We define nn as the number of samples within the tested cohort. Therefore, each exposure and outcome variable consists of nn measurements. In the R language, categorical variables are encoded as factor objects. The function as.factor() can be used to encode categorical variables. The functions levels() and ordered() can be used to define the order of the modalities of categorical variables. By convention, hdmax2 uses the first modality as a reference to calculate the effects associated with the other modalities of the variable, as encoded by default in statistical regression models in R.

  • Continuous intermediary variables, denoted as M, are represented as a matrix encompassing potential mediators, such as methylome or transcriptome molecular features. The matrix M should be entered as a separated input, without missing values. The intermediary variable matrix M is of dimension n×pn \times p, with pp the total number of intermediate variables.

  • The outcome variable, denoted as YY, corresponds to a vector of dimension nn, which supports both continuous and binary formats. Continuous and binary variables must be encoded in numeric format.

  • Optional covariates, ZZ, can be included as observed adjustment factors in the model. Please refer to the helper function section for insight into how to interpret the additional set of covariates.

  • KK, the number of latent factors to be estimated.

To install the latest version of hdmax2, use the github repository

#devtools::install_github("bcm-uga/hdmax2")

METHOD Example: multivariate exposure and continuous outcome

For this example we use a two variables exposure (continuous and binary), a continuous outcome and two adjustment factors (age and gender).

Simulated dataset

We simulated data for 100 samples and 200 potential mediators.

The matrix of potential mediators is an extract of real methylation data (TCGA PDAC samples).

We define the KK number of estimated latent factors by performing a PCA on potential mediators matrix. The scree plot criterion looks for the “elbow” in the curve and selects all components just before the line flattens out, KK is chosen with this procedure.

simu_data = hdmax2::simu_data
## Number of Latent factore estimation
pc <- prcomp(simu_data$M1)
plot((pc$sdev^2/sum(pc$sdev^2))[1:10],
     type = "b",
     xlab = 'Principal Component',
     ylab = "Explained variance")


K=4 #pca conclusion : it is better to select too many factors that too few

STEP 1: Run association studies

The run_AS function is applied:

  • First to estimate latent factors with lfmmlfmm algorithms.

  • Then to identify significant effects of exposure XX on potential mediators in M matrix, and significant effect of potential mediators on outcome YY.

  • And eventually to compute mediation test: max-squared test.

The run_AS function takes as inputs:

  • XX for exposure: can be a vector, a factor with nn (samples) elements or a data frame with nn rows and 1 column if univariate or xx columns or different exposure variables if multivariate.
  • YY for outcome: can be a vector or a matrix with nn rows and 1 column.
  • M for the potential mediators: must be a matrix with nn rows and pp columns (potential mediators).
  • KK for the number of latent factors defined earlier: must be integer.
  • Adjustment factors covar can be included: must be numeric (data frame or matrix) with nn rows.
## multivariate Exposures (continuous + binary)
X = data.frame(X1 = simu_data$X_continuous, X2 = simu_data$X_binary)
covar = cbind(simu_data$age, simu_data$gender)
covar = as.data.frame(covar)

hdmax2_step1 = hdmax2::run_AS(exposure = X ,
                              outcome = simu_data$Y_continuous,
                              M = simu_data$M2,
                              K = K,
                              covar = covar)

The run_AS function provides an object containing:

  • Results from the two association studies (PP-values, fscores or zscores),
head(hdmax2_step1$AS_1$pval)
#>    cg00005740    cg00006787    cg00007032    cg00008612    cg00009871 
#>  3.794621e-88  6.729132e-65 2.716521e-107  9.031578e-75  2.799630e-91 
#>    cg00009922 
#>  2.167705e-03

head(hdmax2_step1$AS_1$fscores)
#> Response cg00005740.value Response cg00006787.value Response cg00007032.value 
#>               11377.11336                3651.15470               28504.22839 
#> Response cg00008612.value Response cg00009871.value Response cg00009922.value 
#>                5943.10164               13230.32412                  24.55842
  • Latent factors estimation matrix UU (pp rows and KK columns) from first regression,
head(hdmax2_step1$AS_1$U)
#>            [,1]       [,2]       [,3]        [,4]
#> [1,] -6.4407431 -1.0637980 -0.6642124 -0.10179492
#> [2,]  1.1180773  0.7722278  0.3412332  0.03089914
#> [3,]  3.1673367  1.4633086  0.8123724 -0.07885703
#> [4,]  0.7052038  0.8200024  0.4004379 -0.07368424
#> [5,] -4.6385980 -0.7803669 -0.5024476 -0.04347855
#> [6,]  7.3524954  1.4323450  0.7671163 -0.32313944
  • run_AS function’s inputs,

  • And max-squared test PP-values results.

head(hdmax2_step1$max2_pvalues)
#>   cg00005740   cg00006787   cg00007032   cg00008612   cg00009871   cg00009922 
#> 3.930603e-07 1.061875e-02 2.933874e-16 2.099247e-04 1.514348e-12 3.184637e-01

Selection of a subset of mediators

max-squared PP-values are used in the selection of mediators for the user’s chosen method.

Numerous selection methods are available, such as FDR control. Also in the context of methylation data, it’s feasible to to identify aggregated mediator regions (AMR) based on the paired PP-values frome mediation test. Refer to the helper_functions vignette for assistance FDR control and AMR researching procedure.

For the example, we opted for the top ten most significant PP-values from the max-squared test.

## Selecting top 10 mediators
mediators_top10 = simu_data$M2[,names(sort(hdmax2_step1$max2_pvalues)[1:10])]
head(mediators_top10)
#>            cg00025981 cg00019093 cg00031759 cg00022633 cg00007032 cg00009871
#> GSM1051525   2.810690   3.333051   2.681391   2.058144   1.431224   2.355835
#> GSM1051526   7.263145   7.430116   7.058559   5.781962   5.076067   6.516013
#> GSM1051527   8.663460   8.633662   8.428823   6.993801   6.293432   7.910915
#> GSM1051528   7.343681   7.353801   7.121072   5.887332   5.290542   6.760180
#> GSM1051529   3.567254   4.151531   3.400333   2.631943   1.900969   2.915196
#> GSM1051530   9.566706  10.000372   9.228616   7.516322   6.406738   8.090836
#>            cg00028749 cg00014667 cg00005740 cg00012692
#> GSM1051525   2.491659   2.052389   2.648494   1.032899
#> GSM1051526   5.744348   5.498459   7.376805   3.708204
#> GSM1051527   6.707678   6.642283   8.956235   4.604570
#> GSM1051528   5.714551   5.694672   7.643873   3.889778
#> GSM1051529   3.106085   2.499390   3.252984   1.383063
#> GSM1051530   7.654547   6.773123   9.050847   4.678163

STEP 2

The function estimate_effect estimate the individual indirect effect of mediators, but also overall effects of selected mediators.

The function estimate_effect takes as inputs, step 1 object and selected mediators matrix MSM^S from chosen selection method apply on max-squared test PP-values.

hdmax2_step2 = hdmax2::estimate_effect(object = hdmax2_step1,
                                       m = mediators_top10)

The function estimate_effect use mediation::mediate function to obtain several effects estimation with uncertainty:

  • ACME Average Causal Mediation Effect: corresponding to the indirect effect
# For first exposure variable
head(hdmax2_step2$effects$X1$ACME)
#>           est       CI_2.5     CI_97.5  pval       feat
#> 1  0.28744276  0.187336837  0.38652463 0.000 cg00025981
#> 2  0.16583563  0.127190132  0.20654127 0.000 cg00019093
#> 3  0.15916386  0.062213637  0.25590771 0.002 cg00031759
#> 4  0.10340946  0.008348204  0.20054958 0.030 cg00022633
#> 5 -0.18351938 -0.327133190 -0.04509537 0.006 cg00007032
#> 6  0.02519542 -0.074828002  0.12490948 0.636 cg00009871

# For first exposure variable
head(hdmax2_step2$effects$X2$ACME)
#>           est      CI_2.5     CI_97.5  pval       feat
#> 1  0.28270063  0.18067236  0.38560807 0.000 cg00025981
#> 2  0.16803145  0.12435169  0.21049290 0.000 cg00019093
#> 3  0.16095023  0.06077006  0.25429286 0.000 cg00031759
#> 4  0.10135026  0.01052350  0.19546690 0.032 cg00022633
#> 5 -0.18398511 -0.33110670 -0.02914637 0.020 cg00007032
#> 6  0.02388381 -0.07319260  0.12097263 0.652 cg00009871
  • PM Proportion Mediate: corresponding to the proportion of the total effect that is mediated by the mediator
# For first exposure variable
head(hdmax2_step2$effects$X1$PM)
#>          est     CI_2.5     CI_97.5  pval       feat
#> 1 -0.5989691 -0.8054299 -0.39066639 0.000 cg00025981
#> 2 -0.3457651 -0.4316470 -0.26397633 0.000 cg00019093
#> 3 -0.3305574 -0.5308632 -0.12927000 0.002 cg00031759
#> 4 -0.2182789 -0.4155508 -0.01734332 0.030 cg00022633
#> 5  0.3742992  0.0941027  0.68034763 0.006 cg00007032
#> 6 -0.0501110 -0.2605593  0.15621632 0.636 cg00009871
head(hdmax2_step2$effects$X2$PM)
#>          est      CI_2.5     CI_97.5  pval       feat
#> 1 -0.5929589 -0.80198906 -0.37466460 0.000 cg00025981
#> 2 -0.3518127 -0.43693413 -0.26037291 0.000 cg00019093
#> 3 -0.3427913 -0.53202931 -0.12643011 0.000 cg00031759
#> 4 -0.2126285 -0.40759815 -0.02174242 0.032 cg00022633
#> 5  0.3847555  0.06066514  0.69409402 0.020 cg00007032
#> 6 -0.0517334 -0.25054803  0.15291725 0.652 cg00009871
  • TE total effect: which is equal to the sum of direct and indirect effect
# For first exposure variable
head(hdmax2_step2$effects$X1$TE)
#>          est     CI_2.5    CI_97.5 pval       feat
#> 1 -0.4809432 -0.4871167 -0.4746203    0 cg00025981
#> 2 -0.4805523 -0.4868689 -0.4741050    0 cg00019093
#> 3 -0.4801520 -0.4864263 -0.4735749    0 cg00031759
#> 4 -0.4797061 -0.4860125 -0.4731113    0 cg00022633
#> 5 -0.4793463 -0.4855869 -0.4730238    0 cg00007032
#> 6 -0.4796200 -0.4859869 -0.4736794    0 cg00009871
head(hdmax2_step2$effects$X2$TE)
#>          est     CI_2.5    CI_97.5 pval       feat
#> 1 -0.4807088 -0.4867609 -0.4747383    0 cg00025981
#> 2 -0.4803955 -0.4866038 -0.4742374    0 cg00019093
#> 3 -0.4798673 -0.4861947 -0.4738398    0 cg00031759
#> 4 -0.4793114 -0.4851246 -0.4732602    0 cg00022633
#> 5 -0.4790902 -0.4847672 -0.4730059    0 cg00007032
#> 6 -0.4792182 -0.4848416 -0.4736264    0 cg00009871
  • ADE Average Direct Effect: which represents the unmediated effect.
# For first exposure variable
head(hdmax2_step2$effects$X1$ADE)
#>          est     CI_2.5    CI_97.5 pval       feat
#> 1 -0.7683860 -0.8681962 -0.6668683    0 cg00025981
#> 2 -0.6463879 -0.6860880 -0.6058558    0 cg00019093
#> 3 -0.6393159 -0.7373529 -0.5430835    0 cg00031759
#> 4 -0.5831156 -0.6806418 -0.4871739    0 cg00022633
#> 5 -0.2958269 -0.4341160 -0.1524207    0 cg00007032
#> 6 -0.5048154 -0.6047025 -0.4038993    0 cg00009871
head(hdmax2_step2$effects$X2$ADE)
#>          est     CI_2.5    CI_97.5 pval       feat
#> 1 -0.7634094 -0.8643089 -0.6616243    0 cg00025981
#> 2 -0.6484270 -0.6908482 -0.6031792    0 cg00019093
#> 3 -0.6408175 -0.7359112 -0.5410028    0 cg00031759
#> 4 -0.5806617 -0.6733152 -0.4890897    0 cg00022633
#> 5 -0.2951051 -0.4491536 -0.1459359    0 cg00007032
#> 6 -0.5031020 -0.6006787 -0.4047876    0 cg00009871

This step also compute Overall effects :

  • OIE (Indirect effect): corresponding to the sum of the indirect effect associated with all mediators.
# Estimate OIE for first exposure variable
head(hdmax2_step2$effects$X1$oie)
#> [1] -0.027217534  0.096574902  0.104392378  0.005929419 -0.009328689
#> [6] -0.035804358
# OIE median for first exposure variable
hdmax2_step2$effects$X1$oie_med
#> [1] -0.03725154
# OIE standard deviation for first exposure variable
hdmax2_step2$effects$X1$oie_sd
#> [1] 0.09416536
  • ODE (Direct Effect): corresponding to the effect of exposure variables on the outcome variable.
hdmax2_step2$effects$X1$ode
#>      Estimate    Std. Error       t value      Pr(>|t|) 
#> -4.432670e-01  9.192462e-02 -4.822070e+00  6.354078e-06
  • OTE (Total Effect): corresponding to the effect of exposure variables on the outcome variable when the mediators MSM^S are included in the model.
hdmax2_step2$effects$X1$ote
#>       Estimate     Std. Error        t value       Pr(>|t|) 
#>  -4.795847e-01   3.166450e-03  -1.514582e+02  4.007925e-113

In addition, function estimate_effect estimates the intermediary effect sizes aja_j and bjb_j and their standard deviations.

head(hdmax2_step2$effects$X1$xm)
#>   Estimate  Std.Error   t.Value        pValue       feat
#> 1 2.711291 0.01497632 181.03849 3.517681e-118 cg00025981
#> 2 1.479219 0.01817362  81.39377  8.455368e-87 cg00019093
#> 3 2.796769 0.01766869 158.28953 6.864229e-113 cg00031759
#> 4 2.716945 0.01906228 142.52990 9.221372e-109 cg00022633
#> 5 3.463424 0.01598202 216.70758 2.852492e-125 cg00007032
#> 6 3.976139 0.02683199 148.18650 2.711328e-110 cg00009871
head(hdmax2_step2$effects$X1$my)
#>       Estimate  Std.Error    t.Value       pValue       feat
#> 1  0.104501476 0.01954063  5.3479068 6.453806e-07 cg00025981
#> 2  0.112511112 0.01484701  7.5780296 2.691362e-11 cg00019093
#> 3  0.056823617 0.01772318  3.2061759 1.850305e-03 cg00031759
#> 4  0.037778407 0.01789985  2.1105431 3.752480e-02 cg00022633
#> 5 -0.052947915 0.02102400 -2.5184506 1.351468e-02 cg00007032
#> 6  0.006157449 0.01290205  0.4772458 6.343189e-01 cg00009871

Vizualisation of results

We propose a set of plots including:

  • Mediators ACME Forest plot

  • Mediators PM Forest plot

  • Comparison of ODE, OIE and OTE

  • Mediators effect size representation

library(ggplot2)
hdmax2::plot_hdmax2(hdmax2_step2, plot_type= "all_plot")
#> [1] "hdmax2 plot for multivariate exposome"

#> $X1
#> TableGrob (2 x 2) "arrange": 4 grobs
#>   z     cells    name           grob
#> 1 1 (1-1,1-1) arrange gtable[layout]
#> 2 2 (1-1,2-2) arrange gtable[layout]
#> 3 3 (2-2,1-1) arrange gtable[layout]
#> 4 4 (2-2,2-2) arrange gtable[layout]
#> 
#> $X2
#> TableGrob (2 x 2) "arrange": 4 grobs
#>   z     cells    name           grob
#> 1 1 (1-1,1-1) arrange gtable[layout]
#> 2 2 (1-1,2-2) arrange gtable[layout]
#> 3 3 (2-2,1-1) arrange gtable[layout]
#> 4 4 (2-2,2-2) arrange gtable[layout]
  • A Estimates of indirect effect (ACME) and B proportions of mediated effect (PM) for the top 10 mediators. The effect estimate is represented by a dot and its 95% CI by the bar. Symbols correspond to the significance cut off of 5% (square for p-value 0.05\geq 0.05, circle p-value <0.05< 0.05). Colors correspond to the sign of the effect (green for estimated effect 0\leq 0 , red for estimated effect >0> 0).

  • C Effect sizes of Overall Direct Effect (ODE), Overall Indirect Effect (OIE) and Overall Total Effect (OTE). Error bars correspond to standard deviation (ODE and OTE) or confidence interval (OIE).

  • D Indirect effect sizes for the selected mediators. Black corresponds to the ACME, violet to the effect of exposure XX on mediator M, and blue corresponds to the effect of mediator M on outcome YY.

In the plot_hdmax2 function it is possible to produce the 4-plots set or each individual plot with plot_type argument.

Helper Functions

In this vignette, we also provide a series of helper function to process the data.

How to analyse agregated methylated regions ?

We simulated 100 samples and 200 potential DNA methylation mediators, with various a binary exposure (smoking status of mothers) and continuous outcomes (birth weight).

Identifying aggregated mediator regions (AMR)

Identify Aggregated Methylated Regions (AMR) with AMR_search function which uses from the PP-values from max-squared test compute in run_AS function and using a adapted comb-p method (comb-p is a tool that manipulates BED files of possibly irregularly spaced PP-values and calculates auto-correlation, combines adjacent PP-values, performs false discovery adjustment, finds regions of enrichment and assigns significance to those regions). AMR identification could be useful in Epigenomic Wide Association Studies (EWAS) when single CpG mediation is unsuccessful, or could be complementary analysis.

helper_ex = hdmax2::helper_ex

pc <- prcomp(helper_ex$methylation)
plot((pc$sdev^2/sum(pc$sdev^2))[1:15],
     xlab = 'Principal Component',
     ylab = "Explained variance",
     col = c(rep(1, 3), 2, rep(1, 16)))

# chosen number of dimension
K=5

## run hdmax2 step1
hdmax2_step1 = hdmax2::run_AS(exposure = helper_ex$exposure,
                              outcome = helper_ex$phenotype,
                              M = helper_ex$methylation, 
                              K = K)

##Detecting AMR
seed = 0.7 #Careful to change this parameter when working with real data
res.amr_search = hdmax2::AMR_search(chr = helper_ex$annotation$chr,
                                    start = helper_ex$annotation$start,
                                    end = helper_ex$annotation$end,
                                    pval = hdmax2_step1$max2_pvalues,
                                    cpg = helper_ex$annotation$cpg,
                                    seed = seed, 
                                    nCores = 2)

res.amr_search$res
#>   chr  start    end          p        fdr
#> 3   1 864893 865027 0.03604853 0.05184658
#> 2   1 858380 858381 0.03901609 0.05184658
#> 4   1 886431 886553 0.04077365 0.05184658
#> 5   1 896773 896774 0.05000351 0.05184658
#> 1   1 854778 855056 0.05184658 0.05184658

res.arm_build = hdmax2::AMR_build(res.amr_search, 
                                  methylation = helper_ex$methylation, 
                                  nb_cpg = 2)

#List of DMR selected
head(res.arm_build$res)
#>    DMR chr  start    end          p        fdr nb
#> 1 DMR1   1 886431 886553 0.04077365 0.05184658  3
#> 2 DMR2   1 854778 855056 0.05184658 0.05184658  6

##CpG in the DMR
res.arm_build$CpG_for_each_AMR
#> $DMR1
#> [1] "cg20322444" "cg07983659" "cg11733071"
#> 
#> $DMR2
#> [1] "cg08029603" "cg16347928" "cg22699361" "cg03890988" "cg21786289"
#> [6] "cg13897241"

Quantifying indirect effects

Like with single mediators analysis esimate_effect function could be use to estimate different effects of AMR. Also plot_hdmax2 can be applied to step2 results.

## run hdmax2 step2
object = hdmax2_step1
m = as.matrix(res.arm_build$AMR_mean)

## selected mediators effects estimation

hdmax2_step2 = hdmax2::estimate_effect(object = hdmax2_step1,
                                       m = m)
library(ggplot2)
hdmax2::plot_hdmax2(hdmax2_step2, plot_type= "all_plot")
#> [1] "hdmax2 plot for univariate exposome"

#> TableGrob (2 x 2) "arrange": 4 grobs
#>   z     cells    name           grob
#> 1 1 (1-1,1-1) arrange gtable[layout]
#> 2 2 (1-1,2-2) arrange gtable[layout]
#> 3 3 (2-2,1-1) arrange gtable[layout]
#> 4 4 (2-2,2-2) arrange gtable[layout]

How to select potential mediators using q-values and FDR control ?

Several methods can be use to select potential mediators from step1 results of our method, in our main use cases we simply select top 10 mediators to simplify the narrative. Among available methods to select mediators from mediation test PP-values, we can use FDR (False Discovery Rate) control. QQ-value is obtained from max-squared test PP-values from step 1 with hdmax2::hdmax2_qvalue function then bounded by chosen threshold.

## run hdmax2 step1
hdmax2_step1 = hdmax2::run_AS(exposure = simu_data$X_binary,
                 outcome = simu_data$Y_continuous,
                 M = simu_data$M1, 
                 K = K)

## Select candidate mediator  
q_values = hdmax2::hdmax2_qvalue(hdmax2_step1$max2_pvalues)
candidate_mediator <- q_values$qv[q_values$qv<= 0.2] # In this example, we will consider FDR levels <20%.
candidate_mediator
#>   cg00019093   cg00025981 
#> 2.175678e-06 1.143605e-02

How to transform a categorial variable into an ordered factor?

When categorical variables are used as exposure variables, hdmax2 uses the first category as a reference (intercept) to calculate the effects associated with the variable’s other categories. The functions HDMAX2::run_AS and HDMAX2::estimate_effect will transform the character vector you have used (if any) into a factor, with an arbitrary ordering of the categories. If order is important to you, here’s a simple way to turn your character vector into a factor and order the categories as you wish. You can then use this variable as input to the hdmax2 functions.

# inverse alphabetical order for example
X = simu_data$X_categorial
X = factor(X, ordered = TRUE, levels = c("D","C","B","A"))

How to handle adding an additional set of covariates to the second association study?

It is possible to add a second adjustment factors set in association study between potential mediators and outcome if it makes sense from a biological standpoint. Nevertheless, this additional set of adjustment factors for the second association study must include the adjustment factors from the first association study as design in run_AS function.

# First adjustment factors set
covar = as.data.frame(helper_ex$covariables[,1:2])
# Second adjustment factors set
suppl_covar = as.data.frame(helper_ex$covariables[,3:4])

hdmax2_step1 = hdmax2::run_AS(exposure = helper_ex$exposure,
                              outcome = helper_ex$phenotype,
                              M = helper_ex$methylation,
                              K = K,
                              covar = covar,
                              suppl_covar = suppl_covar)