Estimating cell-type composition using Gene Expression data

S Karkar, M Richard, Y Blum

2021-10-01

Based on Gene Expression values, the vignette shows how to estimate cell-type composition and cell type-specific expression profiles.

Data

We provide access to a matrix of Gene Expression values D from a pseudo-bulk, that is, a simulation of complex tissue comprised of \(k=9\) cell types.

D = readRDS(url("https://figshare.com/ndownloader/files/30587814"))
dim(D)
## [1] 26485    50
head(D)[1:5, 1:5]
##                 M1           M2          M3       M4       M5
## DDX11L1   1.001393  0.004283769 0.001109519 4.001646 1.000574
## WASH7P    8.967564 11.902306950 6.890561585 9.200173 7.142057
## MIR6859-3 2.000000  1.000000000 0.000000000 0.000000 0.000000
## MIR6859-2 1.000000  1.000000000 3.000000000 0.000000 1.000000
## MIR6859-1 0.000000  0.000000000 0.000000000 0.000000 0.000000

Step 1: Normalizing data

In addition to cell-type composition, Gene Expression values can vary because of other technical variables, such as, for RNA-Seq, read depth. We use the function run_norm to account for this effect. By default, the function assumes that the normalizing method is “RPM”, In addition we provide access to 3 other methods “TPM”, “DESeq2” (and its in-house rewriting : “MR” that do do required the DESeq2 package), and “edgeR”. Note that “TPM” requires the gene length (in bp) in form of a vector of appropriate length.

library(gedepir)
 D_norm=run_norm(mix_matrix = D,method = "RPM")

Step 2: Transforming data

We provide the function run_trans, which performs a transformation of the data. By default, the transformation method is “linear”, which simply return the input data. Additionally, we provide 2 transformations : “log2” and “pseudolog”. “log2” transformation apply the mathematical function \(f(x)=log2(x+1)\) where \(x\) is the input data, and “pseudolog” corresponds to the function \(f(x)=asinh(x)\)

 D_trans= run_trans(mix_matrix = D_norm,method = "log2")

Step 3: Feature selection

The function run_featsel select vataiables (genes) with the largest Coefficient of Variation. We propose to select 5000 genes (“cv5000”) or 1000 genes (“cv1000”). By removing genes that carry little or no information, deconvolution routines run much faster.

 D_fsel= run_featsel(mix_matrix = D_trans, method = "cv1000")

Step 4: Running deconvolution methods

By default, we propose 2 methods of deconvolution: “NMF” to run NMF::snmf, “ICA” from the deconica package. Additionally, when available on the system, “CDSeq” and “PREDE” methods can be used.

To select the number of cell types, Cattell’s rule uses PCA analysis and recommends to keep principal components to the left of the elbow in the eigenvalue plot. In practice Cattell rule suggests to keep the number of principal components at the observed elbow plus one.

Here we show the results obtained with NMF.

results_NMF = run_deconv(mix_matrix = D_fsel,k= 9, method = "NMF")

Step 5: Enrichment analysis

gedepir includes a wrapper for fgsea to analyze the estimated cell type profiles.

We provide several databases available for enrichment analysis available on figShare

Database FileName url Description
CellMarker ‘cellmarkerDB.RDS’ link CellMarker (Zhang et al., 2019, DOI: 10.1093/nar/gky900), human restricted marker file
CEllMatch ‘cellmatchDB.RDS’ link CEllMatch (Shao et al, 2020,DOI: 10.1016/j.isci.2020.100882, included in the scCATCH tool
NCI-60 ‘Cancer_Cell_Lines.RDS’ link NCI-60 Human Tumor Cell Lines Screen
GTEx ‘GTEx.RDS’ link Genotype-Tissue Expression project, specific expression (up and down) for 2918 tissue samples
Reactome ‘Reactome_2016.RDS’ link Reactome db, free, curated and peer-reviewed pathway database. v.2016
GO db ‘GO.DB.RDS’ link Gene Ontology database, annotation for functions of genes
Kegg ‘KEGG_2019_Human.RDS’ link Kyoto Encyclopedia of Genes and Genomes database, high-level annotations from molecular-level information, v.2019
library(fgsea)
## Loading required package: Rcpp
database= readRDS(url("https://figshare.com/ndownloader/files/30593259"))
gedepir::enrich(results_NMF$T_matrix,pathways = database,ICAbased = FALSE, fdr = FALSE)
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
##                         pathway      padj       pval        ES
## 7 Kidney.Normal_cell.Neutrophil 0.2837631 0.03547038 0.9544087
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
##                                    pathway      padj       pval        ES
## 8 Brain.Normal_cell.Lake_et_al.Science.Ex8 0.1506063 0.01882578 0.9107322

Step 6: Build your pipeline

gedepir simplifies the definition of an end-to-end analysis pipeline with a set of base functions that are connected through the pipes syntax %>% used in magrittr, tidyr or dplyr packages.

library(magrittr)
results_NMF= D %>%
  run_norm(method = "RPM") %>%
  run_trans(method = "linear") %>%
  run_featsel(method = "cv1000") %>%
  run_deconv(method = "ICA", ,k= 9)
gedepir::enrich(results_NMF$T_matrix,
                database,
                ICAbased = FALSE,
                fdr = FALSE)
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
##                                        pathway       padj        pval        ES
## 2     Blood.Normal_cell.Circulating_fetal_cell 0.08887257 0.002693108 0.9962321
## 32 Blood.Normal_cell.CD1Cplus_B_dendritic_cell 0.23834007 0.014444852 0.9849699
## 
## [[4]]
##                                       pathway      padj       pval        ES
## 19 Embryo.Normal_cell.Primitive_endoderm_cell 0.4041927 0.01224826 0.9614836
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL