Based on Gene Expression values, the vignette shows how to estimate cell-type composition and cell type-specific expression profiles.
We provide access to a matrix of Gene Expression values D
from a pseudo-bulk, that is, a simulation of complex tissue comprised of \(k=9\) cell types.
= readRDS(url("https://figshare.com/ndownloader/files/30587814"))
D dim(D)
## [1] 26485 50
head(D)[1:5, 1:5]
## M1 M2 M3 M4 M5
## DDX11L1 1.001393 0.004283769 0.001109519 4.001646 1.000574
## WASH7P 8.967564 11.902306950 6.890561585 9.200173 7.142057
## MIR6859-3 2.000000 1.000000000 0.000000000 0.000000 0.000000
## MIR6859-2 1.000000 1.000000000 3.000000000 0.000000 1.000000
## MIR6859-1 0.000000 0.000000000 0.000000000 0.000000 0.000000
In addition to cell-type composition, Gene Expression values can vary because of other technical variables, such as, for RNA-Seq, read depth. We use the function run_norm
to account for this effect. By default, the function assumes that the normalizing method is “RPM”, In addition we provide access to 3 other methods “TPM”, “DESeq2” (and its in-house rewriting : “MR” that do do required the DESeq2 package), and “edgeR”. Note that “TPM” requires the gene length (in bp) in form of a vector of appropriate length.
library(gedepir)
=run_norm(mix_matrix = D,method = "RPM") D_norm
We provide the function run_trans
, which performs a transformation of the data. By default, the transformation method is “linear”, which simply return the input data. Additionally, we provide 2 transformations : “log2” and “pseudolog”. “log2” transformation apply the mathematical function \(f(x)=log2(x+1)\) where \(x\) is the input data, and “pseudolog” corresponds to the function \(f(x)=asinh(x)\)
= run_trans(mix_matrix = D_norm,method = "log2") D_trans
The function run_featsel
select vataiables (genes) with the largest Coefficient of Variation. We propose to select 5000 genes (“cv5000”) or 1000 genes (“cv1000”). By removing genes that carry little or no information, deconvolution routines run much faster.
= run_featsel(mix_matrix = D_trans, method = "cv1000") D_fsel
By default, we propose 2 methods of deconvolution: “NMF” to run NMF::snmf
, “ICA” from the deconica package. Additionally, when available on the system, “CDSeq” and “PREDE” methods can be used.
To select the number of cell types, Cattell’s rule uses PCA analysis and recommends to keep principal components to the left of the elbow in the eigenvalue plot. In practice Cattell rule suggests to keep the number of principal components at the observed elbow plus one.
Here we show the results obtained with NMF
.
= run_deconv(mix_matrix = D_fsel,k= 9, method = "NMF") results_NMF
gedepir
includes a wrapper for fgsea
to analyze the estimated cell type profiles.
We provide several databases available for enrichment analysis available on figShare
Database | FileName | url | Description |
---|---|---|---|
CellMarker | ‘cellmarkerDB.RDS’ | link | CellMarker (Zhang et al., 2019, DOI: 10.1093/nar/gky900), human restricted marker file |
CEllMatch | ‘cellmatchDB.RDS’ | link | CEllMatch (Shao et al, 2020,DOI: 10.1016/j.isci.2020.100882, included in the scCATCH tool |
NCI-60 | ‘Cancer_Cell_Lines.RDS’ | link | NCI-60 Human Tumor Cell Lines Screen |
GTEx | ‘GTEx.RDS’ | link | Genotype-Tissue Expression project, specific expression (up and down) for 2918 tissue samples |
Reactome | ‘Reactome_2016.RDS’ | link | Reactome db, free, curated and peer-reviewed pathway database. v.2016 |
GO db | ‘GO.DB.RDS’ | link | Gene Ontology database, annotation for functions of genes |
Kegg | ‘KEGG_2019_Human.RDS’ | link | Kyoto Encyclopedia of Genes and Genomes database, high-level annotations from molecular-level information, v.2019 |
library(fgsea)
## Loading required package: Rcpp
= readRDS(url("https://figshare.com/ndownloader/files/30593259"))
database::enrich(results_NMF$T_matrix,pathways = database,ICAbased = FALSE, fdr = FALSE) gedepir
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## pathway padj pval ES
## 7 Kidney.Normal_cell.Neutrophil 0.2837631 0.03547038 0.9544087
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## pathway padj pval ES
## 8 Brain.Normal_cell.Lake_et_al.Science.Ex8 0.1506063 0.01882578 0.9107322
gedepir simplifies the definition of an end-to-end analysis pipeline with a set of base functions that are connected through the pipes syntax %>%
used in magrittr, tidyr or dplyr packages.
library(magrittr)
= D %>%
results_NMFrun_norm(method = "RPM") %>%
run_trans(method = "linear") %>%
run_featsel(method = "cv1000") %>%
run_deconv(method = "ICA", ,k= 9)
::enrich(results_NMF$T_matrix,
gedepir
database,ICAbased = FALSE,
fdr = FALSE)
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## pathway padj pval ES
## 2 Blood.Normal_cell.Circulating_fetal_cell 0.08887257 0.002693108 0.9962321
## 32 Blood.Normal_cell.CD1Cplus_B_dendritic_cell 0.23834007 0.014444852 0.9849699
##
## [[4]]
## pathway padj pval ES
## 19 Embryo.Normal_cell.Primitive_endoderm_cell 0.4041927 0.01224826 0.9614836
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL