Skip to contents

Introduction

When a statistical association is observed between an external exposure (XX) and an individual outcome (YY), one or more intermediate variables (M) (such as gene expression or epigenetic changes) may mediate this effect. Identifying and assessing the mediating role of these variables in the effect of XX on YY is crucial for deciphering underlying causal mechanisms in epidemiological and clinical research. This process, known as mediation analysis, involves studying mediator variables to define the causal structure between XX and YY. The mediated effect, termed the indirect effect, is equal to the effect of XX on YY mediated through (M), to distinguish from the direct effect of XX on YY unexplained by (M).

The R package hdmax2 is dedicated to high-dimensional mediation analyses. For mediation analyses, the necessary data includes one (or more) exposure variable(s), a matrix of potential mediators, and an outcome variable.

The term “high dimensional” refers to the large quantity of potential mediators among which we seek to identify the actual mediators.

The fundamental concept behind HDMAX2 methods is to use a latent factor mixed regression model for estimating unobserved latent factors while conducting high-dimensional association analysis. HDMAX2 also implements a novel procedure known as the max-squared test to assess the statistical significance of potential mediators. Finally, HDMAX2 enables the calculation of an overall indirect effect from a single model that includes all mediators simultaneously. This approach therefore takes into account correlations between mediators.

The hdmax2 program has been enriched with numerous features, including the ability to manage various types of variables in the exposure (continuous, binary, categorical, and multivariate), as well as the capability to incorporate binary outcomes. This package enables users to:

  • investigate associations between the variables XX, (M), and YY,

  • compute the mediated effect for each potential mediator,

  • assess the overall indirect effect for the total model,

  • and visualize these results.

The package hdmax2 is designed to accept :

  • One or several Exposure variables XX consisting of univariate data, which can be continuous, binary or categorical as well as multivariate exposure. We define nn as the number of samples within the tested cohort. Therefore, each exposure and outcome variable consists of nn measurements. In the R language, categorical variables are encoded as factor objects. The function as.factor() can be used to encode categorical variables. The functions levels() and ordered() can be used to define the order of the modalities of categorical variables. By convention, hdmax2 uses the first modality as a reference to calculate the effects associated with the other modalities of the variable, as encoded by default in statistical regression models in R.

  • Continuous intermediary variables, denoted as M, are represented as a matrix encompassing potential mediators, such as methylome or transcriptome molecular features. The matrix M should be entered as a separated input, without missing values. The intermediary variable matrix M is of dimension n×pn \times p, with pp the total number of intermediate variables.

  • The outcome variable, denoted as YY, corresponds to a vector of dimension nn, which supports both continuous and binary formats. Continuous and binary variables must be encoded in numeric format.

  • Optional covariates, ZZ, can be included as observed adjustment factors in the model. Please refer to the helper function section for insight into how to interpret the additional set of covariates.

  • KK, the number of latent factors to be estimated.

Example (simulation) : multivariate exposure and continuous outcome

For this example we use a two variables exposure (continuous and binary), a continuous outcome and two adjustment factors (age and gender).

We simulated data for 100 samples and 1000 potential mediators.

The matrix of potential mediators is an extract of real methylation data (TCGA PDAC samples).

We define the KK number of estimated latent factors by performing a PCA on potential mediators matrix. Cattell’s rule is applied to the eigenvalues of PCA. The scree plot criterion looks for the “elbow” in the curve and selects all components just before the line flattens out, KK is chosen with this procedure.

simu_data = hdmax2::simu_data
## Number of Latent factor estimation
pc <- prcomp(simu_data$M1)
plot((pc$sdev^2/sum(pc$sdev^2))[1:10],
     type = "b",
     xlab = 'Principal Component',
     ylab = "Explained variance")


K=4 #pca conclusion : it is better to select too many factors than too few

STEP 1: Run association studies

The run_AS function is applied:

  • First to estimate latent factors with lfmmlfmm algorithms.

  • Then to identify significant effects of exposure XX on potential mediators in M matrix, and significant effect of potential mediators on outcome YY.

  • And eventually to compute mediation test: max-squared test.

The run_AS function takes as inputs:

  • XX for exposure: can be a vector, a factor with nn (samples) elements or a data frame with nn rows and 1 column if univariate or xx columns or different exposure variables if multivariate.
  • YY for outcome: can be a vector or a matrix with nn rows and 1 column.
  • M for the potential mediators: must be a matrix with nn rows and pp columns (potential mediators).
  • KK for the number of latent factors defined earlier: must be integer.
  • Adjustment factors covar can be included: must be numeric (data frame or matrix) with nn rows.
## multivariate Exposures (continuous + binary)
X = data.frame(X_continuous = simu_data$X_continuous, X_binary = simu_data$X_binary)
covar = cbind(simu_data$age, simu_data$gender)
covar = as.data.frame(covar)

hdmax2_step1 = hdmax2::run_AS(exposure = X ,
                              outcome = simu_data$Y_continuous,
                              M = simu_data$M2,
                              K = K,
                              covar = covar)

The run_AS function provides an object containing:

  • Results from the two association studies (PP-values, fscores or zscores),
head(hdmax2_step1$AS_1$pval)
#>    cg00005740    cg00006787    cg00007032    cg00008612    cg00009871 
#> 6.385434e-110  7.557479e-86 3.515170e-127  5.140365e-98 8.322935e-110 
#>    cg00009922 
#>  7.946406e-11
head(hdmax2_step1$AS_2$pval)
#>   cg00005740   cg00006787   cg00007032   cg00008612   cg00009871   cg00009922 
#> 6.540743e-04 1.186044e-01 7.450322e-08 1.549392e-07 1.007703e-04 5.130813e-01
head(hdmax2_step1$max2_pvalues)
#>   cg00005740   cg00006787   cg00007032   cg00008612   cg00009871   cg00009922 
#> 4.278131e-07 1.406700e-02 5.550730e-15 2.400616e-14 1.015465e-08 2.632524e-01

At this stage, we can use the figure below to assess two assumptions: panel A evaluates the independence of the two p-value vectors from two association studies (showing weak correlation), while panel B examines the uniformity of max2 test p-values (except for values near 0).

pval1 = hdmax2_step1$AS_1$pval
pval2 = hdmax2_step1$AS_2$pval
cor_value = cor(pval1,pval2)
p1 = ggplot(data = data.frame(x = pval1, y = pval2), aes(x = pval1, y = pval2)) +
  geom_bin2d(bins = 50) + 
  scale_fill_gradient(low = "lightblue", high = "darkblue", 
                      guide = guide_colorbar(title = paste("correlation\n = ",round(cor_value,3),"\n \n Count"), 
                                             title.position = "top")) + 
  theme_minimal() +
  labs(title = "A",
       x = "1srt regresssion (M ~ X) pvalues", y = "2nd regresssion (Y ~ M + X) pvalues")

data_hdmax2_step1 <- data.frame(pvalues = hdmax2_step1$max2_pvalues)

p2 = ggplot(data_hdmax2_step1, aes(x = pvalues)) + geom_histogram(bins = 100, color = "black", fill = "gray")  + labs(title = "B", x = "Max2 test pvalues distribution")

gridExtra::grid.arrange( p1, p2, ncol = 2, top = grid::textGrob("Hypothesis verification", gp = grid::gpar(fontsize = 14, fontface = "bold")))

  • Latent factors estimation matrix UU (nn rows and KK columns) from first regression,
head(hdmax2_step1$AS_1$U)
#>            [,1]       [,2]       [,3]       [,4]
#> [1,] -12.585133 -1.1510431  0.3444928  0.1350501
#> [2,]   3.406759  0.5755978 -0.7455638 -0.5796212
#> [3,]   7.942628  1.7442171 -1.0279125 -0.5658921
#> [4,]   2.884806  0.8904630 -0.3638173 -0.5666794
#> [5,]  -9.200969 -0.9469759  0.3270254  0.1069010
#> [6,]  14.865324  1.9505182 -1.3129179 -0.9547634
  • run_AS function’s inputs,

  • And max-squared test PP-values results.

head(hdmax2_step1$max2_pvalues)
#>   cg00005740   cg00006787   cg00007032   cg00008612   cg00009871   cg00009922 
#> 4.278131e-07 1.406700e-02 5.550730e-15 2.400616e-14 1.015465e-08 2.632524e-01

Selection of a subset of mediators

max-squared PP-values are used in the selection of mediators for the user’s chosen method.

Numerous selection methods are available, such as FDR control. Also in the context of methylation data, it’s feasible to to identify aggregated mediator regions (AMR) based on the paired PP-values frome mediation test. Refer to the helper_functions vignette for assistance FDR control and AMR researching procedure.

For the example, we opted for the top ten most significant PP-values from the max-squared test.

## Selecting top 10 mediators
mediators_top10 = simu_data$M2[,names(sort(hdmax2_step1$max2_pvalues)[1:10])]
head(mediators_top10)
#>            cg00025981 cg00075768 cg00019093 cg00066729 cg00031759 cg00084432
#> GSM1051525   1.685693   2.218030   1.557068   2.430756   1.691808   3.780360
#> GSM1051526   6.161752   5.784076   5.691395   5.489877   6.089739   8.069034
#> GSM1051527   7.679215   6.889730   7.079877   6.397163   7.563050   9.386660
#> GSM1051528   6.431278   5.836811   5.913431   5.450084   6.318493   8.108019
#> GSM1051529   2.266971   2.807172   2.098831   3.032556   2.256564   4.550265
#> GSM1051530   7.782111   7.576316   7.183111   7.398095   7.658832  10.399671
#>            cg00076854 cg00007032 cg00008612 cg00022633
#> GSM1051525   1.966973   1.431224   2.318326   1.459519
#> GSM1051526   4.500903   5.076067   6.277443   5.195897
#> GSM1051527   5.262907   6.293432   7.585078   6.470071
#> GSM1051528   4.485116   5.290542   6.481324   5.401831
#> GSM1051529   2.456586   1.900969   2.779885   1.940046
#> GSM1051530   5.998859   6.406738   7.633106   6.566717

STEP 2

The function estimate_effect estimate the individual indirect effect of mediators, but also overall effects of selected mediators.

The function estimate_effect takes as inputs, step 1 object and selected mediators matrix MSM^S from chosen selection method apply on max-squared test PP-values.

hdmax2_step2 = hdmax2::estimate_effect(object = hdmax2_step1,
                                       m = mediators_top10)

The function estimate_effect use mediation::mediate function to obtain several effects estimation with uncertainty:

  • ACME Average Causal Mediation Effect: corresponding to the indirect effect
# For first exposure variable
head(hdmax2_step2$effects$X_continuous$ACME)
#>           est      CI_2.5    CI_97.5  pval       feat
#> 1 -0.01304356 -0.15316146 0.12248337 0.852 cg00025981
#> 2  0.10258765  0.04528690 0.15940566 0.000 cg00075768
#> 3 -0.08542741 -0.20593479 0.04302218 0.190 cg00019093
#> 4  0.06638600  0.04034929 0.09045835 0.000 cg00066729
#> 5 -0.02710839 -0.12544108 0.07477326 0.622 cg00031759
#> 6  0.06127123  0.03241881 0.09050522 0.000 cg00084432

# For second exposure variable
head(hdmax2_step2$effects$X_binary$ACME)
#>           est      CI_2.5    CI_97.5  pval       feat
#> 1 -0.01370498 -0.14358649 0.12674401 0.810 cg00025981
#> 2  0.10317822  0.04470301 0.15996852 0.000 cg00075768
#> 3 -0.08300970 -0.20413818 0.02548981 0.142 cg00019093
#> 4  0.06761704  0.04402782 0.09198780 0.000 cg00066729
#> 5 -0.02891196 -0.11859991 0.07052789 0.526 cg00031759
#> 6  0.06123107  0.03364058 0.08993683 0.000 cg00084432
  • PM Proportion Mediate: corresponding to the proportion of the total effect that is mediated by the mediator
# For first exposure variable
head(hdmax2_step2$effects$X_continuous$PM)
#>           est      CI_2.5     CI_97.5  pval       feat
#> 1  0.02057927 -0.20167979  0.25053008 0.852 cg00025981
#> 2 -0.17020911 -0.26168036 -0.07453482 0.000 cg00075768
#> 3  0.14124496 -0.07076453  0.33918020 0.190 cg00019093
#> 4 -0.10935662 -0.14857499 -0.06645359 0.000 cg00066729
#> 5  0.04263660 -0.12300266  0.20691685 0.622 cg00031759
#> 6 -0.10116310 -0.14832064 -0.05340975 0.000 cg00084432
head(hdmax2_step2$effects$X_binary$PM)
#>          est     CI_2.5     CI_97.5  pval       feat
#> 1  0.0261049 -0.2091815  0.23730024 0.810 cg00025981
#> 2 -0.1705253 -0.2638407 -0.07376784 0.000 cg00075768
#> 3  0.1337503 -0.0420255  0.33654578 0.142 cg00019093
#> 4 -0.1112690 -0.1518981 -0.07258813 0.000 cg00066729
#> 5  0.0490334 -0.1162687  0.19556855 0.526 cg00031759
#> 6 -0.1009399 -0.1484695 -0.05547531 0.000 cg00084432
  • TE total effect: which is equal to the sum of direct and indirect effect
# For first exposure variable
head(hdmax2_step2$effects$X_continuous$TE)
#>          est     CI_2.5    CI_97.5 pval       feat
#> 1 -0.6080202 -0.6109257 -0.6053362    0 cg00025981
#> 2 -0.6079681 -0.6107435 -0.6052486    0 cg00075768
#> 3 -0.6079451 -0.6108141 -0.6050691    0 cg00019093
#> 4 -0.6081180 -0.6109370 -0.6052852    0 cg00066729
#> 5 -0.6078970 -0.6107084 -0.6049778    0 cg00031759
#> 6 -0.6080090 -0.6107905 -0.6052484    0 cg00084432
head(hdmax2_step2$effects$X_binary$TE)
#>          est     CI_2.5    CI_97.5 pval       feat
#> 1 -0.6066297 -0.6088581 -0.6043464    0 cg00025981
#> 2 -0.6066024 -0.6089089 -0.6043313    0 cg00075768
#> 3 -0.6065822 -0.6087380 -0.6041756    0 cg00019093
#> 4 -0.6065998 -0.6087729 -0.6041996    0 cg00066729
#> 5 -0.6066375 -0.6088579 -0.6043354    0 cg00031759
#> 6 -0.6066125 -0.6089220 -0.6044071    0 cg00084432
  • ADE Average Direct Effect: which represents the unmediated effect.
# For first exposure variable
head(hdmax2_step2$effects$X_continuous$ADE)
#>          est     CI_2.5    CI_97.5 pval       feat
#> 1 -0.5949766 -0.7302751 -0.4551744    0 cg00025981
#> 2 -0.7105557 -0.7670859 -0.6528410    0 cg00075768
#> 3 -0.5225177 -0.6508185 -0.4018631    0 cg00019093
#> 4 -0.6745040 -0.6983064 -0.6485477    0 cg00066729
#> 5 -0.5807886 -0.6830120 -0.4813271    0 cg00031759
#> 6 -0.6692802 -0.6988777 -0.6400616    0 cg00084432
head(hdmax2_step2$effects$X_binary$ADE)
#>          est     CI_2.5    CI_97.5 pval       feat
#> 1 -0.5929247 -0.7332082 -0.4614971    0 cg00025981
#> 2 -0.7097806 -0.7661822 -0.6506865    0 cg00075768
#> 3 -0.5235725 -0.6319352 -0.4024419    0 cg00019093
#> 4 -0.6742169 -0.6986354 -0.6506313    0 cg00066729
#> 5 -0.5777255 -0.6770343 -0.4877155    0 cg00031759
#> 6 -0.6678435 -0.6967012 -0.6400349    0 cg00084432

This step also compute Overall effects :

  • OIE (Indirect effect): corresponding to the sum of the indirect effect associated with all mediators.
# Estimate OIE for first exposure variable
head(hdmax2_step2$effects$X_continuous$oie)
#> [1] -0.108760741 -0.307064060 -0.006280289 -0.110997644 -0.026805343
#> [6] -0.175713553
# OIE median for first exposure variable
hdmax2_step2$effects$X_continuous$oie_med
#> [1] -0.1693495
# OIE standard deviation for first exposure variable
hdmax2_step2$effects$X_continuous$oie_sd
#> [1] 0.09119682
  • ODE (Direct Effect): corresponding to the effect of exposure variables on the outcome variable.
hdmax2_step2$effects$X_continuous$ode
#>      Estimate    Std. Error       t value      Pr(>|t|) 
#> -4.377037e-01  8.379034e-02 -5.223797e+00  1.331761e-06
  • OTE (Total Effect): corresponding to the effect of exposure variables on the outcome variable when the mediators MSM^S are included in the model.
hdmax2_step2$effects$X_continuous$ote
#>       Estimate     Std. Error        t value       Pr(>|t|) 
#>  -6.079726e-01   1.445786e-03  -4.205137e+02  1.921157e-151

In addition, function estimate_effect estimates the intermediary effect sizes aja_j and bjb_j and their standard deviations.

head(hdmax2_step2$effects$X_continuous$xm)
#>    Estimate   Std.Error   t.Value        pValue       feat
#> 1 4.2259065 0.009426571 448.29732 5.704808e-154 cg00025981
#> 2 1.9188493 0.009315837 205.97713 2.872086e-123 cg00075768
#> 3 3.8615528 0.009380363 411.66348 1.329727e-150 cg00019093
#> 4 0.7093047 0.007568283  93.72068  2.624847e-92 cg00066729
#> 5 4.1366585 0.012411325 333.29708 2.910289e-142 cg00031759
#> 6 1.7521157 0.017207734 101.82141  1.490810e-95 cg00084432
head(hdmax2_step2$effects$X_continuous$my)
#>       Estimate   Std.Error   t.Value       pValue       feat
#> 1 -0.003192755 0.016163477 -0.197529 8.438590e-01 cg00025981
#> 2  0.053059523 0.015373373  3.451391 8.513707e-04 cg00075768
#> 3 -0.022233686 0.016076691 -1.382976 1.700936e-01 cg00019093
#> 4  0.093697599 0.017548046  5.339489 6.935596e-07 cg00066729
#> 5 -0.006766110 0.012258311 -0.551961 5.823431e-01 cg00031759
#> 6  0.035020765 0.008050415  4.350182 3.579826e-05 cg00084432

Vizualisation of results

We propose a set of plots including:

  • Mediators ACME Forest plot

  • Mediators PM Forest plot

  • Comparison of ODE, OIE and OTE

  • Mediators effect size representation

hdmax2::plot_hdmax2(hdmax2_step2, plot_type= "all_plot")

  • A Estimates of indirect effect (ACME) and B proportions of mediated effect (PM) for the top 10 mediators. The effect estimate is represented by a dot and its 95% CI by the bar. Symbols correspond to the significance cut off of 5% (square for p-value 0.05\geq 0.05, circle p-value <0.05< 0.05). Colors correspond to the sign of the effect (green for estimated effect 0\leq 0 , red for estimated effect >0> 0).

  • C Effect sizes of Overall Direct Effect (ODE), Overall Indirect Effect (OIE) and Overall Total Effect (OTE). Error bars correspond to standard deviation (ODE and OTE) or confidence interval (OIE).

  • D Indirect effect sizes for the selected mediators. Black corresponds to the ACME, violet to the effect of exposure XX on mediator M, and blue corresponds to the effect of mediator M on outcome YY.

In the plot_hdmax2 function it is possible to produce the 4-plots set or each individual plot with plot_type argument.

1rst Use Case : Breast cancer

tcga_brca was downloaded and processed from the Cancer Genome Atlas TCGA-BRCA dataset (. 2019 version of the data.)

It is a list of 4 matrices : X (exposure), M (potential mediators), Y (outcome) and CF (confounding factor)) :

  • X : exposure variable, a matrix of 904 rows (patients) and 1 column (her-status : positive = 1 and negative = 0)
  • M : potential mediators, a matrix of 19959 rows (beta-values of CpG probes) and 904 columns (patients)
  • Y : outcome variable, a matrix of 904 rows (patients) and 1 column (risk-score, a continuous variable)
  • CF : confounding factor, a matrix of 904 rows (patients) and 1 column (age)

if (!(file.exists("tcga_brca.RData"))) {
  url_file <- "https://zenodo.org/records/14716548/files/tcga_brca.RData?download=1"
  response <- httr::GET(url_file, httr::write_disk("tcga_brca.RData", overwrite = TRUE))
  
  if (response$status_code == 200) {
    load("tcga_brca.RData")
  } else {
    stop("Download failed")
  }
}
## Number of Latent factore estimation
pc <- prcomp(tcga_brca$M)
plot((pc$sdev^2/sum(pc$sdev^2))[1:10],
     type = "b",
     xlab = 'Principal Component',
     ylab = "Explained variance")


K = 2 #The screeplot indicates around 2 main components in the data. 

STEP 1: Run association studies

The run_AS function is applied to estimate the effects of exposure XX on a matrix MM of potential mediators, and the effect of each potential mediators on outcome YY .We also add the age as a known counfouding factor CFCF.


hdmax2_step1 = hdmax2::run_AS(exposure = data.frame(tcga_brca$X),
                              outcome = as.vector(tcga_brca$Y),
                              M = t(tcga_brca$M), 
                              K = K,
                              covar = tcga_brca$CF)

max-squared test PP-values results.

head(hdmax2_step1$AS_1$pval)
#>   cg00000292   cg00002426   cg00003994   cg00005847   cg00007981   cg00008493 
#> 0.0009873141 0.7773962442 0.2813067905 0.0944946938 0.1972223151 0.0626456118

pval1 = hdmax2_step1$AS_1$pval
pval2 = hdmax2_step1$AS_2$pval
cor_value = cor(pval1,pval2)

p1 = ggplot2::ggplot(data = data.frame(x = pval1, y = pval2), aes(x = pval1, y = pval2)) +
  geom_bin2d(bins = 50) + 
  scale_fill_gradient(low = "lightblue", high = "darkblue", 
                      guide = guide_colorbar(title = paste("correlation\n = ",round(cor_value,3),"\n \n Count"), 
                                             title.position = "top")) + 
  theme_minimal() +
  labs(title = "A",
       x = "1srt regresssion (M ~ X) pvalues", y = "2nd regresssion (Y ~ M + X) pvalues")

data_hdmax2_step1 <- data.frame(pvalues = hdmax2_step1$max2_pvalues)

p2 = ggplot2::ggplot(data_hdmax2_step1, aes(x = pvalues)) + geom_histogram(bins = 100, color = "black", fill = "gray")  + labs(title = "B", x = "Max2 test pvalues distribution")

gridExtra::grid.arrange( p1, p2, ncol = 2, top = grid::textGrob("Hypothesis verification", gp = grid::gpar(fontsize = 14, fontface = "bold")))

Selection of a subset of mediators

We selected the top 10 mediators.

## Selecting top 10 mediators
top10_max2 = sort(hdmax2_step1$max2_pvalues)[1:10]
top10_max2
#>   cg11911951   cg26530341   cg00138126   cg11203041   cg03070194   cg01804429 
#> 1.156282e-08 5.207120e-08 7.359065e-07 7.819724e-07 5.101811e-06 5.404395e-06 
#>   cg15852891   cg13760253   cg03684977   cg00400263 
#> 6.103286e-06 1.420400e-05 1.948391e-05 3.468113e-05
mediators_top10 = (t(tcga_brca$M))[,names(sort(hdmax2_step1$max2_pvalues)[1:10])]

STEP 2

The function estimate_effect estimate the individual indirect effect of mediators, but also overall effects of selected mediators.

The function estimate_effect takes as inputs, step 1 object and selected mediators matrix MSM^S from chosen selection method apply on max-squared test PP-values.

hdmax2_step2 = hdmax2::estimate_effect(object = hdmax2_step1,
                                       m = mediators_top10)

The function estimate_effect use mediation::mediate function to obtain several effects estimation with uncertainty:

  • ACME Average Causal Mediation Effect: corresponding to the indirect effect
  • PM Proportion Mediate: corresponding to the proportion of the total effect that is mediated by the mediator
  • TE total effect: which is equal to the sum of direct and indirect effect
  • ADE Average Direct Effect: which represents the unmediated effect
head(hdmax2_step2$effects$her2_status$ACME)
#>         est    CI_2.5   CI_97.5 pval       feat
#> 1 0.2777776 0.2002591 0.3698010    0 cg11911951
#> 2 0.2300377 0.1476274 0.3196329    0 cg26530341
#> 3 0.1844932 0.1141724 0.2598899    0 cg00138126
#> 4 0.2150545 0.1322434 0.3029459    0 cg11203041
#> 5 0.1985895 0.1306884 0.2745838    0 cg03070194
#> 6 0.1688528 0.1044416 0.2381085    0 cg01804429
head(hdmax2_step2$effects$her2_status$PM)
#>         est    CI_2.5  CI_97.5  pval       feat
#> 1 1.1328288 0.5729764 6.086671 0.022 cg11911951
#> 2 0.9382485 0.4366347 4.753406 0.030 cg26530341
#> 3 0.7548266 0.3812159 3.317571 0.018 cg00138126
#> 4 0.8754547 0.4401678 3.941255 0.030 cg11203041
#> 5 0.8502728 0.4005929 3.516775 0.022 cg03070194
#> 6 0.7000365 0.3054582 3.327866 0.036 cg01804429
head(hdmax2_step2$effects$her2_status$TE)
#>         est     CI_2.5   CI_97.5  pval       feat
#> 1 0.2396293 0.03093309 0.4377093 0.022 cg11911951
#> 2 0.2393738 0.01721429 0.4495655 0.030 cg26530341
#> 3 0.2382278 0.03970348 0.4438876 0.018 cg00138126
#> 4 0.2389713 0.02608946 0.4545398 0.030 cg11203041
#> 5 0.2318520 0.03681107 0.4275658 0.022 cg03070194
#> 6 0.2354500 0.01459732 0.4396714 0.036 cg01804429
head(hdmax2_step2$effects$her2_status$ADE)
#>            est     CI_2.5   CI_97.5  pval       feat
#> 1 -0.038148279 -0.2412501 0.1486566 0.738 cg11911951
#> 2  0.009336078 -0.1979560 0.2025614 0.930 cg26530341
#> 3  0.053734580 -0.1436135 0.2492398 0.576 cg00138126
#> 4  0.023916803 -0.1865651 0.2187343 0.804 cg11203041
#> 5  0.033262513 -0.1578930 0.2306731 0.718 cg03070194
#> 6  0.066597216 -0.1488057 0.2704019 0.514 cg01804429

This step also compute Overall effects :

  • OIE (Indirect effect): corresponding to the sum of the indirect effect associated with all mediators.
  • ODE (Direct Effect): corresponding to the effect of exposure variables on the outcome variable.
  • OTE (Total Effect): corresponding to the effect of exposure variables on the outcome variable when the mediators MSM^S are included in the model.

hdmax2_step2$effects$her2_status$oie_med # OIE median for first exposure variable
#> [1] 0.6012288
hdmax2_step2$effects$her2_status$ode
#>      Estimate    Std. Error       t value      Pr(>|t|) 
#> -0.3668303938  0.1029342201 -3.5637360789  0.0003850724
hdmax2_step2$effects$her2_status$ote
#>   Estimate Std. Error    t value   Pr(>|t|) 
#> 0.23456885 0.10567773 2.21966201 0.02669056

Vizualisation of results

hdmax2::plot_hdmax2(hdmax2_step2, plot_type= "all_plot")

  • A Estimates of indirect effect (ACME) and B proportions of mediated effect (PM) for the top 10 mediators. The effect estimate is represented by a dot and its 95% CI by the bar. Symbols correspond to the significance cut off of 5% (square for p-value 0.05\geq 0.05, circle p-value <0.05< 0.05). Colors correspond to the sign of the effect (green for estimated effect 0\leq 0 , red for estimated effect >0> 0).

  • C Effect sizes of Overall Direct Effect (ODE), Overall Indirect Effect (OIE) and Overall Total Effect (OTE). Error bars correspond to standard deviation (ODE and OTE) or confidence interval (OIE).

  • D Indirect effect sizes for the selected mediators. Black corresponds to the ACME, violet to the effect of exposure XX on mediator M, and blue corresponds to the effect of mediator M on outcome YY.

In the plot_hdmax2 function it is possible to produce the 4-plots set or each individual plot with plot_type argument.

2nd Use Case : Multiple sclerosis

ms_GSE137143 was downloaded and processed from the GEO repository (. 2024 version of the data.)

It is a list of 3 matrices : X (exposure), M (potential mediators), Y (outcome) :

  • X : exposure variable, a matrix of 104 rows (samples) and 1 column (gender : female = 1 and male = 0)
  • M : potential mediators, a matrix of 17283 rows (gene expression) and 104 columns (samples)
  • Y : outcome variable, a matrix of 104 rows (samples) and 1 column (subtype : RR = 1, CIS = 0)
if (!(file.exists("ms_GSE137143.RData"))) {
  url_file <- "https://zenodo.org/records/14716548/files/ms_GSE137143.RData?download=1"
  response <- httr::GET(url_file, httr::write_disk("ms_GSE137143.RData", overwrite = TRUE))
  
  if (response$status_code == 200) {
    load("ms_GSE137143.RData")
  } else {
    stop("Download failed")
  }
}
## Number of Latent factore estimation
pc <- prcomp(ms_GSE137143$M)
plot((pc$sdev^2/sum(pc$sdev^2))[1:10],
     type = "b",
     xlab = 'Principal Component',
     ylab = "Explained variance")


K = 2 #The screeplot indicates around 2 main components in the data. 

STEP 1: Run association studies

The run_AS function is applied to estimate the effects of exposure XX on a matrix MM of potential mediators, and the effect of each potential mediators on outcome YY .We also add the age as a known counfouding factor CFCF.


hdmax2_step1 = hdmax2::run_AS(exposure = data.frame(ms_GSE137143$X),
                              outcome = as.vector(ms_GSE137143$Y),
                              M = t(ms_GSE137143$M), 
                              K = K)

max-squared test PP-values results.

head(hdmax2_step1$AS_1$pval)
#>    TSPAN6      DPM1     SCYL3  C1orf112       FGR       CFH 
#> 0.2875623 0.3091126 0.9806292 0.1870420 0.3149651 0.6262667


pval1 = hdmax2_step1$AS_1$pval
pval2 = hdmax2_step1$AS_2$pval
cor_value = cor(pval1,pval2)

p1 = ggplot2::ggplot(data = data.frame(x = pval1, y = pval2), aes(x = pval1, y = pval2)) +
  geom_bin2d(bins = 50) + 
  scale_fill_gradient(low = "lightblue", high = "darkblue", 
                      guide = guide_colorbar(title = paste("correlation\n = ",round(cor_value,3),"\n \n Count"), 
                                             title.position = "top")) + 
  theme_minimal() +
  labs(title = "A",
       x = "1srt regresssion (M ~ X) pvalues", y = "2nd regresssion (Y ~ M + X) pvalues")


data_hdmax2_step1 <- data.frame(pvalues = hdmax2_step1$max2_pvalues)

p2 = ggplot2::ggplot(data_hdmax2_step1, aes(x = pvalues)) + geom_histogram(bins = 100, color = "black", fill = "gray")  + labs(title = "B", x = "Max2 test pvalues distribution")

gridExtra::grid.arrange( p1, p2, ncol = 2, top = grid::textGrob("Hypothesis verification", gp = grid::gpar(fontsize = 14, fontface = "bold")))

Selection of a subset of mediators

We selected the top 10 mediators.

## Selecting top 10 mediators
top2_max2 = sort(hdmax2_step1$max2_pvalues)[1:2]
top2_max2
#>       MVB12B         NT5E 
#> 2.035825e-06 4.526574e-06
mediators_top2 = (t(ms_GSE137143$M))[,names(sort(hdmax2_step1$max2_pvalues)[1:2])]

STEP 2

The function estimate_effect estimate the individual indirect effect of mediators, but also overall effects of selected mediators.

The function estimate_effect takes as inputs, step 1 object and selected mediators matrix MSM^S from chosen selection method apply on max-squared test PP-values.

hdmax2_step2 = hdmax2::estimate_effect(object = hdmax2_step1,
                                       m = mediators_top2)

The function estimate_effect use mediation::mediate function to obtain several effects estimation with uncertainty:

  • ACME Average Causal Mediation Effect: corresponding to the indirect effect
  • PM Proportion Mediate: corresponding to the proportion of the total effect that is mediated by the mediator
  • TE total effect: which is equal to the sum of direct and indirect effect
  • ADE Average Direct Effect: which represents the unmediated effect
head(hdmax2_step2$effects$gender$ACME)
#>           est      CI_2.5     CI_97.5  pval   feat
#> 1 -0.08841351 -0.18974415 -0.02025732 0.002 MVB12B
#> 2  0.06462609  0.01593701  0.14019367 0.002   NT5E
head(hdmax2_step2$effects$gender$PM)
#>          est      CI_2.5     CI_97.5  pval   feat
#> 1  0.2813908  0.06261491  1.11418073 0.004 MVB12B
#> 2 -0.1907163 -0.85971848 -0.04296857 0.002   NT5E
head(hdmax2_step2$effects$gender$TE)
#>          est     CI_2.5    CI_97.5  pval   feat
#> 1 -0.2972814 -0.4580537 -0.1009035 0.002 MVB12B
#> 2 -0.3103675 -0.4695157 -0.1356918 0.000   NT5E
head(hdmax2_step2$effects$gender$ADE)
#>          est     CI_2.5     CI_97.5  pval   feat
#> 1 -0.1739796 -0.3465319  0.01285077 0.078 MVB12B
#> 2 -0.4430841 -0.6056380 -0.24470431 0.000   NT5E

This step also compute Overall effects :

  • OIE (Indirect effect): corresponding to the sum of the indirect effect associated with all mediators.
  • ODE (Direct Effect): corresponding to the effect of exposure variables on the outcome variable.
  • OTE (Total Effect): corresponding to the effect of exposure variables on the outcome variable when the mediators MSM^S are included in the model.
hdmax2_step2$effects$gender$oie_med # OIE median for first exposure variable
#> [1] -0.008105571
hdmax2_step2$effects$gender$ode
#>     Estimate   Std. Error      z value     Pr(>|z|) 
#> -1.824871656  0.686797321 -2.657074510  0.007882202
hdmax2_step2$effects$gender$ote
#>     Estimate   Std. Error      z value     Pr(>|z|) 
#> -1.604244218  0.548823596 -2.923059847  0.003466099

Vizualisation of results

We propose a set of plots including:

  • Mediators ACME Forest plot

  • Mediators PM Forest plot

  • Comparison of ODE, OIE and OTE

  • Mediators effect size representation

hdmax2::plot_hdmax2(hdmax2_step2, plot_type= "all_plot")

  • A Estimates of indirect effect (ACME) and B proportions of mediated effect (PM) for the top 10 mediators. The effect estimate is represented by a dot and its 95% CI by the bar. Symbols correspond to the significance cut off of 5% (square for p-value 0.05\geq 0.05, circle p-value <0.05< 0.05). Colors correspond to the sign of the effect (green for estimated effect 0\leq 0 , red for estimated effect >0> 0).

  • C Effect sizes of Overall Direct Effect (ODE), Overall Indirect Effect (OIE) and Overall Total Effect (OTE). Error bars correspond to standard deviation (ODE and OTE) or confidence interval (OIE).

  • D Indirect effect sizes for the selected mediators. Black corresponds to the ACME, violet to the effect of exposure XX on mediator M, and blue corresponds to the effect of mediator M on outcome YY.

In the plot_hdmax2 function it is possible to produce the 4-plots set or each individual plot with plot_type argument.

Helper Functions

In this vignette, we also provide a series of helper function to process the data.

How to analyse agregated methylated regions ?

We simulated 100 samples and 200 potential DNA methylation mediators, with various a binary exposure (smoking status of mothers) and continuous outcomes (birth weight).

Identifying aggregated mediator regions (AMR)

Identify Aggregated Methylated Regions (AMR) with AMR_search function which uses from the PP-values from max-squared test compute in run_AS function and using a adapted comb-p method (comb-p is a tool that manipulates BED files of possibly irregularly spaced PP-values and calculates auto-correlation, combines adjacent PP-values, performs false discovery adjustment, finds regions of enrichment and assigns significance to those regions). AMR identification could be useful in Epigenomic Wide Association Studies (EWAS) when single CpG mediation is unsuccessful, or could be complementary analysis.

helper_ex = hdmax2::helper_ex

pc <- prcomp(helper_ex$methylation)
plot((pc$sdev^2/sum(pc$sdev^2))[1:10],
     type = "b",
     xlab = 'Principal Component',
     ylab = "Explained variance")

# chosen number of dimension
K=5
STEP 1: Run association studies
## run hdmax2 step1
hdmax2_step1 = hdmax2::run_AS(exposure = helper_ex$exposure,
                              outcome = helper_ex$phenotype,
                              M = helper_ex$methylation, 
                              K = K)
Detecting AMR

seed = 0.7 #Careful to change this parameter when working with real data
res.amr_search = hdmax2::AMR_search(chr = helper_ex$annotation$chr,
                                    start = helper_ex$annotation$start,
                                    end = helper_ex$annotation$end,
                                    pval = hdmax2_step1$max2_pvalues,
                                    cpg = helper_ex$annotation$cpg,
                                    seed = seed, 
                                    nCores = 2)

res.amr_search$res
#>   chr  start    end          p        fdr
#> 3   1 864893 865027 0.03604853 0.05184658
#> 2   1 858380 858381 0.03901609 0.05184658
#> 4   1 886431 886553 0.04077365 0.05184658
#> 5   1 896773 896774 0.05000351 0.05184658
#> 1   1 854778 855056 0.05184658 0.05184658

res.arm_build = hdmax2::AMR_build(res.amr_search, 
                                  methylation = helper_ex$methylation, 
                                  nb_cpg = 2)

#List of DMR selected
head(res.arm_build$res)
#>    DMR chr  start    end          p        fdr nb
#> 1 DMR1   1 886431 886553 0.04077365 0.05184658  3
#> 2 DMR2   1 854778 855056 0.05184658 0.05184658  6

##CpG in the DMR
res.arm_build$CpG_for_each_AMR
#> $DMR1
#> [1] "cg20322444" "cg07983659" "cg11733071"
#> 
#> $DMR2
#> [1] "cg08029603" "cg16347928" "cg22699361" "cg03890988" "cg21786289"
#> [6] "cg13897241"

Quantifying indirect effects (Step 2)

Like with single mediators analysis esimate_effect function could be use to estimate different effects of AMR. Also plot_hdmax2 can be applied to step2 results.

## run hdmax2 step2
object = hdmax2_step1
m = as.matrix(res.arm_build$AMR_mean)

## selected mediators effects estimation

hdmax2_step2 = hdmax2::estimate_effect(object = hdmax2_step1,
                                       m = m)

hdmax2::plot_hdmax2(hdmax2_step2, plot_type= "all_plot")

How to select potential mediators using q-values and FDR control ?

Several methods can be use to select potential mediators from step1 results of our method, in our main use cases we simply select top 10 mediators to simplify the narrative. Among available methods to select mediators from mediation test PP-values, we can use FDR (False Discovery Rate) control. QQ-value is obtained from max-squared test PP-values from step 1 with hdmax2::hdmax2_qvalue function then bounded by chosen threshold.

## run hdmax2 step1
hdmax2_step1 = hdmax2::run_AS(exposure = simu_data$X_binary,
                 outcome = simu_data$Y_continuous,
                 M = simu_data$M1, 
                 K = K)

## Select candidate mediator  
q_values = hdmax2::hdmax2_qvalue(hdmax2_step1$max2_pvalues)
candidate_mediator <- q_values$qv[q_values$qv<= 0.2] # In this example, we will consider FDR levels <20%.
candidate_mediator
#>   cg00026919   cg00035636   cg00049102   cg00049616   cg00050792   cg00055679 
#> 1.434951e-01 1.748547e-03 2.445138e-03 6.103922e-02 3.891147e-05 6.756850e-02 
#>   cg00056767   cg00066729   cg00071026   cg00071658   cg00071727   cg00075768 
#> 5.357341e-02 4.286252e-11 9.843811e-09 1.073321e-04 3.527028e-03 3.897026e-05 
#>   cg00076854   cg00084432   cg00089444   cg00092518   cg00093478   cg00098553 
#> 1.909772e-06 4.403319e-06 6.582410e-12 1.581844e-01 2.279497e-05 3.023057e-07 
#>   cg00102591   cg00119011   cg00121843   cg00124836   cg00126948   cg00127894 
#> 4.846930e-22 1.376804e-14 3.505372e-03 5.728461e-10 1.650267e-09 1.299045e-05 
#>   cg01165817 
#> 6.346643e-02

How to transform a categorial variable into an ordered factor?

When categorical variables are used as exposure variables, hdmax2 uses the first category as a reference (intercept) to calculate the effects associated with the variable’s other categories. The functions HDMAX2::run_AS and HDMAX2::estimate_effect will transform the character vector you have used (if any) into a factor, with an arbitrary ordering of the categories. If order is important to you, here’s a simple way to turn your character vector into a factor and order the categories as you wish. You can then use this variable as input to the hdmax2 functions.

# inverse alphabetical order for example
X = simu_data$X_categorial
X = factor(X, ordered = TRUE, levels = c("D","C","B","A"))

How to handle adding an additional set of covariates to the second association study?

It is possible to add a second adjustment factors set in association study between potential mediators and outcome if it makes sense from a biological standpoint. Nevertheless, this additional set of adjustment factors for the second association study must include the adjustment factors from the first association study as design in run_AS function.

# First adjustment factors set
covar = as.data.frame(helper_ex$covariables[,1:2])
# Second adjustment factors set
suppl_covar = as.data.frame(helper_ex$covariables[,3:4])

hdmax2_step1 = hdmax2::run_AS(exposure = helper_ex$exposure,
                              outcome = helper_ex$phenotype,
                              M = helper_ex$methylation,
                              K = K,
                              covar = covar,
                              suppl_covar = suppl_covar)