Tutorial 2 - Inferring Protein Activity from scRNA-seq data from multiple cell populations with the meta-VIPER approach
This tutorial explores the use of metaVIPER to assess protein activity by integrative analysis using multiple gene regulatory networks. metaVIPER is particularly useful in real-world scenarios to infer protein activity in a tissue-independent manner by using multiple, non-tissue-matched interactomes, assuming that transcriptional targets of each protein can be recapitulated by one or more available interactomes. The viper
function allows running metaVIPER using both
aREA and NaRnEA enrichment methods. For additional details, please refer to Ding et al., 2016. After a brief description of the installation procedure and the modules needed, this notebook is organized in the following sections.
Table of Contents
Install PyVIPER
Install pyviper
from PyPI using pip. Alternatively, refer to the README in the current GitHub to install from the local directory.
[1]:
# !pip install viper-in-python
Import modules
Load pyviper
and additional modules required used in this tutorial.
[2]:
import pyviper
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*") # for jit decorator issue with sc.pp.neighbors (09/30/2023)
Setup the path to the data.
[3]:
data_location = "https://zenodo.org/records/10059791/files/"
We will import the data directly from the above link. Optionally, users are allowed to download the files from here and to load them from their local computer. If the second option is preferred, the following lines can be uncommented and run to download the data on Unix systems and set the working directory to the path where the data will be stored.
[4]:
#!mkdir Tutorial_2_data_dir
#%cd Tutorial_2_data_dir
#!curl -O "{data_location}Tutorial_2_counts_mixed_4632.tsv.gz" -O "{data_location}Tutorial_2_metadata_mixed_4632.tsv.gz"
#!curl -O "{data_location}B-cell-net.tsv" -O "{data_location}ductal-2-net.tsv" -O "{data_location}fibroblast-net.tsv" -O "{data_location}stellate-net.tsv"
#data_location = !pwd
#data_location = data_location[0]+"/" # new value for data if working on data downloaded locally
Step 1. Load a gene expression matrix and associated metadata
Load the gene expression matrix (UMIs) and store it into an AnnData object to enable interoperability with scanpy. Cells used in this tutorial were sampled from scRNA-seq data published in Peng et al., 2019.
[5]:
gExpr_path = data_location + "Tutorial_2_counts_mixed_4632.tsv.gz" # path to gene expression matrix (UMI counts)
adata_gExpr = pd.read_csv(gExpr_path, sep="\t") # read from remote
adata_gExpr = sc.AnnData(adata_gExpr) # convert to AnnData object
Display matrix dimensions (cells x genes)
[6]:
adata_gExpr
[6]:
AnnData object with n_obs × n_vars = 4632 × 24005
Load cell-associated metadata.
[7]:
metadata_path = data_location + "Tutorial_2_metadata_mixed_4632.tsv.gz" # path to cells metadata
cells_metadata = pd.read_csv(metadata_path, sep="\t") # load it
Store the metadata in the adata_gExpr
object as observation annotation.
[8]:
adata_gExpr.obs = pd.merge(adata_gExpr.obs, cells_metadata, how="left",left_index=True, right_index=True) # store cell-specific metadata as annotation observation
Display the observation annotation from the AnnData object.
[9]:
adata_gExpr.obs.head()
[9]:
Cell_Type | |
---|---|
T1_AACCATGCACAACTGT | Ductal cell type 2 |
T1_AAGACCTAGTCATGCT | Ductal cell type 2 |
T1_ACATACGAGACTCGGA | Ductal cell type 2 |
T1_ACCAGTATCTTGCAAG | Ductal cell type 2 |
T1_ACGATGTTCACGAAGG | Ductal cell type 2 |
The observation annotations include the annotated cell type for each single-cell. List the cell types.
[10]:
adata_gExpr.obs.groupby('Cell_Type').size().reset_index(name='n') # show cell types and number of cells for each type in AnnData
[10]:
Cell_Type | n | |
---|---|---|
0 | B cell | 1000 |
1 | Ductal cell type 2 | 1455 |
2 | Fibroblast cell | 1277 |
3 | Stellate cell | 900 |
Data contain 4 cells types: B cell, (malignant) ductal cells of type 2, fibroblasts, and stellate cells (myofibroblast-like cells).
Step 2. Preprocess and generate a gene expression signature at the single-cell level
The UMI matrix can be processed using the standard scanpy
preprocessing workflow. For a more detailed explanation of quality control steps, refer to the preprocessing tutorials by scanpy or Seurat.
[11]:
sc.pp.filter_cells(adata_gExpr, min_genes=200) # filter out cells with <200 genes expressed
sc.pp.filter_genes(adata_gExpr, min_cells=3) # filter out genes that are detected in <3 cells
Display matrix dimensions post-filtering (cells x genes)
[12]:
adata_gExpr
[12]:
AnnData object with n_obs × n_vars = 4632 × 20197
obs: 'Cell_Type', 'n_genes'
var: 'n_cells'
Normalize the gene expression matrix by the total number of UMIs to have 10,000 UMIs per cells and log-transform the normalized counts to make each gene expression comparable across cells. Then, store the log-normalized counts in the .raw
attribute of the AnnData
object
[13]:
sc.pp.normalize_total(adata_gExpr, target_sum=1e4) # normalize counts
sc.pp.log1p(adata_gExpr) # log-transform the counts
sc.pp.calculate_qc_metrics(adata_gExpr, percent_top=None, log1p=False, inplace=True) # compute some statistics, e.g. total_counts, n_cells_by_counts etc
adata_gExpr.raw = adata_gExpr # store log-normalized counts to .raw Attribute for potential future use
Generate a gene expression signature on a single-cell basis by scaling each gene to unit variance. This scaled matrix will be used as an input to viper
to compute the protein activity matrix. Several approaches exist for the purpose. These include the calculation of a signature with respect to an external reference or the use of models such as SCTransform. We will employ the simplest approach for demonstrative
purposes.
[14]:
# sc.pp.regress_out(adata_gExpr, 'total_counts') # regress out the effect of the total number of UMIs
sc.pp.scale(adata_gExpr, max_value=10) # standardize gene expression and clip max values to 10
[15]:
adata_gExpr
[15]:
AnnData object with n_obs × n_vars = 4632 × 20197
obs: 'Cell_Type', 'n_genes', 'n_genes_by_counts', 'total_counts'
var: 'n_cells', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'mean', 'std'
uns: 'log1p'
Display a chunk of the scaled (i.e. the signature) matrix.
[16]:
adata_gExpr[1:5,1:7].to_df()
[16]:
AP006222.2 | RP11-206L10.3 | RP11-206L10.2 | RP11-206L10.9 | LINC00115 | FAM41C | |
---|---|---|---|---|---|---|
T1_AAGACCTAGTCATGCT | -0.373427 | -0.106789 | -0.06304 | -0.082578 | -0.187788 | -0.126722 |
T1_ACATACGAGACTCGGA | -0.373427 | -0.106789 | -0.06304 | -0.082578 | -0.187788 | -0.126722 |
T1_ACCAGTATCTTGCAAG | -0.373427 | -0.106789 | -0.06304 | -0.082578 | -0.187788 | -0.126722 |
T1_ACGATGTTCACGAAGG | -0.373427 | -0.106789 | -0.06304 | -0.082578 | -0.187788 | -0.126722 |
Step 3. Load multiple ARACNe-inferred gene regulatory networks
Load 4 lineage-specific ARACNe3-inferred gene regulatory networks. These networks were generated from B cells, malignant ductal, fibroblasts and stellate cells, respectively. Refer to the ARACNe3 and ARACNe-AP manuscripts for additional information. The time required is approximately 1 s per network (when loading from local) and 40 s per network (when loading from the remote url).
[17]:
b_net_path = data_location + "B-cell-net.tsv" # path to ARACNe network for B cells
d2_net_path = data_location + "ductal-2-net.tsv" # path to ARACNe network for ductal 2
f_net_path = data_location + "fibroblast-net.tsv" # path to ARACNe network for fibroblasts
s_net_path = data_location + "stellate-net.tsv" # path to ARACNe network for stellate cells
bcell_net = pyviper.Interactome('b_cell',net_table=b_net_path) # load interactome for B cells
ductal_2_net = pyviper.Interactome('ductal_2',net_table=d2_net_path) # load interactome for ductal cell type 2
fibroblast_net = pyviper.Interactome('fibroblast',net_table=f_net_path) # load interactome for fibroblasts
stellate_net = pyviper.Interactome('stellate',net_table=s_net_path) # load interactome for stellate cells
Each regulatory network loaded as an instance of class Interactome
:
Due to the similarity between regulatory networks of fibroblasts and stellate cells, we integrate their corresponding interactomes, thus generating a consensus network.
[18]:
fibroblast_net.integrate(stellate_net, normalize_likelihoods=False)
Optionally, save the integrated interactome as a .tsv file if needed for later use, by uncommenting the following line. Possible output formats are: “csv”, “tsv” and “pkl”.
[19]:
#fibroblast_net.save("fibroblast_integrated.tsv", output_type="tsv")
Filter out targets in the Interactomes
that are not present in the gene expression matrix
[20]:
bcell_net.filter_targets(adata_gExpr.var_names)
ductal_2_net.filter_targets(adata_gExpr.var_names)
fibroblast_net.filter_targets(adata_gExpr.var_names)
Removed 38885 targets.
Removed 11073 targets.
Removed 84211 targets.
Prune each regulon to have exactly 100 targets (this step can be omitted when setting enrichment=narnea
in pyviper
)
[21]:
bcell_net.prune(max_targets=200,eliminate=True)
ductal_2_net.prune(max_targets=200,eliminate=True)
fibroblast_net.prune(max_targets=200,eliminate=True)
Removed 2837239 targets.
Removed 283 regulators.
Removed 2118839 targets.
Removed 10 regulators.
Removed 6427448 targets.
Removed 64 regulators.
Convert the gene expression signature into a protein activity matrix using pyviper (metaVIPER approach)
Convert previously computed gene expression signature into protein activity by using multiple interactomes simultaneousely via the metaVIPER approach. Use enrichment="narnea"
with 1 core and store the result in an AnnData
object (default). See Tutorial 1 for details on the available modalities to compute enrichment.
[22]:
adata_PA = pyviper.viper(gex_data=adata_gExpr, # gene expression signature
interactome=[bcell_net, ductal_2_net, fibroblast_net], # list of interactomes
enrichment = "narnea",
eset_filter=False,
njobs=1,
verbose=True)
Preparing the association scores
Computing regulons enrichment with NaRnEa
0/3 networks complete.
reordering genes
Calculating DES...
Calculating UES...
Calculating NES...
Calculating PES...
1/3 networks complete.
reordering genes
Calculating DES...
Calculating UES...
Calculating NES...
Calculating PES...
2/3 networks complete.
reordering genes
Calculating DES...
Calculating UES...
Calculating NES...
Calculating PES...
3/3 networks complete.
Integrating results
The output AnnData
object stores a matrix collecting the Normalized Enrichment Scores (NES) and a matrix of Proportional Enrichment Scores (PES) with cells on rows and regulatory proteins on columns. See Tutorial 1 and Griffin et al., 2022 for more details .
[23]:
adata_PA
[23]:
AnnData object with n_obs × n_vars = 4632 × 6659
obs: 'Cell_Type', 'n_genes', 'n_genes_by_counts', 'total_counts'
uns: 'gex_data'
layers: 'pes'
Step 4. Analyze single-cells at the Protein Activity level
We present some instructive analyses at the protein activity level. Start by running Principal Components Analysis (PCA) on the PES matrix, to reduce the dimensionality of the protein activity space. Use the PCA function avaiable through pyviper.tl
, a module that provides several wrappers to scanpy.tl
to simplify the transformation of protein activity data. In contrast to Tutorial 1, all recovered transcription factors, cofactors, signaling and surface proteins will be employed in the
analysis.
[24]:
pyviper.tl.pca(adata_PA, layer="pes", filter_by_feature_groups=None, svd_solver='arpack', random_state=0)
Compute the neighbors graph of cells using the PCA representation of the protein activity matrix. For sake of simplicity, we set 10 nearest neightbors and 30 principal components. Typically, these parameters need tuning.
[25]:
sc.pp.neighbors(adata_PA, n_neighbors=10, n_pcs=30, metric="correlation", random_state=0)
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Generate a UMAP embedding of the previously computed graph.
[26]:
sc.tl.umap(adata_PA, random_state=0)
Display the 2-dimensional UMAP embedding and color by cell type.
[27]:
sc.set_figure_params(dpi=90)
pyviper.pl.umap(adata_PA,color="Cell_Type", palette="Set3", add_outline=True)
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:369: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
ax.scatter(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:379: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
ax.scatter(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(
As an example, show the activity of selected markers for of ductal cells and fibroblasts in a dotplot.
[28]:
protein_markers_dict = {
'B cell': 'CD19',
'Ductal cell type 2': ['KRT19', 'TSPAN8'],
'Fibroblast cell': 'VIM',
'Stellate cell': 'GEM'
}
sc.set_figure_params(dpi=100)
pyviper.pl.dotplot(adata_PA, var_names=protein_markers_dict, layer="pes", groupby='Cell_Type', cmap="RdBu_r", vcenter=0)
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_dotplot.py:749: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap', 'norm' will be ignored
dot_ax.scatter(x, y, **kwds)
Identify differentially active proteins on a cell type basis using the Wilcoxon Rank sum test and display a heatmap.
[29]:
sc.settings.verbosity = 2 # reduce the verbosity
sc.tl.rank_genes_groups(adata=adata_PA, groupby='Cell_Type', use_raw=False, layer="pes", n_genes=20, method='wilcoxon')
sc.pl.rank_genes_groups_heatmap(adata_PA, vmin=-25, vcenter=0, vmax=25, cmap='RdBu_r')
ranking genes
finished (0:00:03)
WARNING: dendrogram data not found (using key=dendrogram_Cell_Type). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
using 'X_pca' with n_pcs = 50
Storing dendrogram info using `.uns['dendrogram_Cell_Type']`
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: RuntimeWarning: invalid value encountered in log2
self.stats[group_name, 'logfoldchanges'] = np.log2(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: RuntimeWarning: invalid value encountered in log2
self.stats[group_name, 'logfoldchanges'] = np.log2(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: RuntimeWarning: invalid value encountered in log2
self.stats[group_name, 'logfoldchanges'] = np.log2(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/tools/_rank_genes_groups.py:420: RuntimeWarning: invalid value encountered in log2
self.stats[group_name, 'logfoldchanges'] = np.log2(
Pathway enrichment analysis
Perform an enrichment analysis to identify biological functions associated to malignant cells by leveraging functional gene sets from the MSigDB collection. Through its load
module,pyviper
allows the user to choose between several MSigDB collections, including curated gene sets from online pathway databases (‘c2’), ontology (‘c5’) and oncogenic signature (‘c6’) gene sets and immunologic signature gene sets (‘c7’).
Load the ‘hallmark gene sets’ as an object of class Interactome
and retain only the targets in common with the protein activity matrix.
[30]:
hallmarks_MSigDB = pyviper.load.msigdb_regulon('h') # load hallmark of cancer MSigDB as an interactome
hallmarks_MSigDB.filter_targets(adata_PA.var_names)
Removed 3587 targets.
Perform the enrichment analysis using the path_enr
function from pyviper.tl
.
[31]:
pathways_enrichment = pyviper.tl.path_enr(adata_PA, pathway_interactome=hallmarks_MSigDB, enrichment="narnea", verbose="False", store_input_data=False)
Removed 0 targets.
Preparing the association scores
Computing regulons enrichment with NaRnEa
reordering genes
Calculating DES...
Calculating UES...
Calculating NES...
Calculating PES...
Focus on malignant ductal cells and compute the mean NES for each gene set.
[32]:
pathways_enrichment = pathways_enrichment[pathways_enrichment.obs['Cell_Type'] == 'Ductal cell type 2'].copy()
Integrated_hallmarks_NES = pathways_enrichment.to_df().mean(axis=0)
Convert NES for each hallmark into p values and adjust for multiple hypothesis testing with the Benjamini-Hochberg procedure.
[33]:
Integrated_hallmarks_pvals = pyviper._pp._nes_to_pval_df(Integrated_hallmarks_NES, lower_tail=False, adjust=True)
Compute the -log10(P value) and display the top 5 enriched pathways.
[34]:
Integrated_hallmarks_mlog10 = -np.log10(Integrated_hallmarks_pvals) # convert to -log10(P adjust)
top_pathways = Integrated_hallmarks_mlog10.sort_values(ascending=True)[-5:] # sort by -log10(P value)
ax = top_pathways.plot(kind='barh', color='#649dcd', legend=False, fontsize=10, hatch='///', edgecolor='black')
plt.title("Top hallmark gene sets")
plt.xlabel("-log$_{10}$(P adjusted)")
plt.show()
Key takeaways
This Tutorial describes how to generate a gene expression signature by exploiting the inter-operatibility between pyviper
and scanpy
and showed how to infer protein activity in single cells from multiple populations with the metaVIPER approach, that leverages multiple lineage-specific regulatory networks. We conclude this Tutorial by presenting another key module used to perform pathway enrichment analysis.