Estimating cell-type composition using Gene Expression data

Based on Gene Expression values, the vignette shows how to estimate cell-type composition and cell type-specific expression profiles.

Data

We provide access to a matrix of Gene Expression values D from a pseudo-bulk, that is, a simulation of complex tissue comprised of \(k=9\) cell types.

D = readRDS(url("https://figshare.com/ndownloader/files/30587814"))
dim(D)

## [1] 26485    50

head(D)[1:5, 1:5]

##                 M1           M2          M3       M4       M5
## DDX11L1   1.001393  0.004283769 0.001109519 4.001646 1.000574
## WASH7P    8.967564 11.902306950 6.890561585 9.200173 7.142057
## MIR6859-3 2.000000  1.000000000 0.000000000 0.000000 0.000000
## MIR6859-2 1.000000  1.000000000 3.000000000 0.000000 1.000000
## MIR6859-1 0.000000  0.000000000 0.000000000 0.000000 0.000000

Step 1: Normalizing data

In addition to cell-type composition, Gene Expression values can vary because of other technical variables, such as, for RNA-Seq, read depth. We use the function run_norm to account for this effect. By default, the function assumes that the normalizing method is “RPM”, In addition we provide access to 3 other methods “TPM”, “DESeq2” (and its in-house rewriting : “MR” that do do required the DESeq2 package), and “edgeR”. Note that “TPM” requires the gene length (in bp) in form of a vector of appropriate length.

library(gedepir)
 D_norm=run_norm(mix_matrix = D,method = "RPM")

Step 2: Transforming data

We provide the function run_trans, which performs a transformation of the data. By default, the transformation method is “linear”, which simply return the input data. Additionally, we provide 2 transformations : “log2” and “pseudolog”. “log2” transformation apply the mathematical function \(f(x)=log2(x+1)\) where \(x\) is the input data, and “pseudolog” corresponds to the function \(f(x)=asinh(x)\)

 D_trans= run_trans(mix_matrix = D_norm,method = "log2")

Step 3: Feature selection

The function run_featsel select vataiables (genes) with the largest Coefficient of Variation. We propose to select 5000 genes (“cv5000”) or 1000 genes (“cv1000”). By removing genes that carry little or no information, deconvolution routines run much faster.

 D_fsel= run_featsel(mix_matrix = D_trans, method = "cv1000")

Step 4: Running deconvolution methods

By default, we propose 2 methods of deconvolution: “NMF” to run NMF::snmf, “ICA” from the deconica package. Additionally, when available on the system, “CDSeq” and “PREDE” methods can be used.

To select the number of cell types, Cattell’s rule uses PCA analysis and recommends to keep principal components to the left of the elbow in the eigenvalue plot. In practice Cattell rule suggests to keep the number of principal components at the observed elbow plus one.

Here we show the results obtained with NMF.

results_NMF = run_deconv(mix_matrix = D_fsel,k= 9, method = "NMF")

Step 5: Enrichment analysis

gedepir includes a wrapper for fgsea to analyze the estimated cell type profiles.

We provide several databases available for enrichment analysis available on figShare

Database	FileName	url	Description
CellMarker	‘cellmarkerDB.RDS’	link	CellMarker (Zhang et al., 2019, DOI: 10.1093/nar/gky900), human restricted marker file
CEllMatch	‘cellmatchDB.RDS’	link	CEllMatch (Shao et al, 2020,DOI: 10.1016/j.isci.2020.100882, included in the scCATCH tool
NCI-60	‘Cancer_Cell_Lines.RDS’	link	NCI-60 Human Tumor Cell Lines Screen
GTEx	‘GTEx.RDS’	link	Genotype-Tissue Expression project, specific expression (up and down) for 2918 tissue samples
Reactome	‘Reactome_2016.RDS’	link	Reactome db, free, curated and peer-reviewed pathway database. v.2016
GO db	‘GO.DB.RDS’	link	Gene Ontology database, annotation for functions of genes
Kegg	‘KEGG_2019_Human.RDS’	link	Kyoto Encyclopedia of Genes and Genomes database, high-level annotations from molecular-level information, v.2019

library(fgsea)

## Loading required package: Rcpp

database= readRDS(url("https://figshare.com/ndownloader/files/30593259"))
gedepir::enrich(results_NMF$T_matrix,pathways = database,ICAbased = FALSE, fdr = FALSE)

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
##                         pathway      padj       pval        ES
## 7 Kidney.Normal_cell.Neutrophil 0.2837631 0.03547038 0.9544087
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
##                                    pathway      padj       pval        ES
## 8 Brain.Normal_cell.Lake_et_al.Science.Ex8 0.1506063 0.01882578 0.9107322

Step 6: Build your pipeline

gedepir simplifies the definition of an end-to-end analysis pipeline with a set of base functions that are connected through the pipes syntax %>% used in magrittr, tidyr or dplyr packages.

library(magrittr)
results_NMF= D %>%
  run_norm(method = "RPM") %>%
  run_trans(method = "linear") %>%
  run_featsel(method = "cv1000") %>%
  run_deconv(method = "ICA", ,k= 9)
gedepir::enrich(results_NMF$T_matrix,
                database,
                ICAbased = FALSE,
                fdr = FALSE)

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
##                                        pathway       padj        pval        ES
## 2     Blood.Normal_cell.Circulating_fetal_cell 0.08887257 0.002693108 0.9962321
## 32 Blood.Normal_cell.CD1Cplus_B_dendritic_cell 0.23834007 0.014444852 0.9849699
## 
## [[4]]
##                                       pathway      padj       pval        ES
## 19 Embryo.Normal_cell.Primitive_endoderm_cell 0.4041927 0.01224826 0.9614836
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL