Tutorial 3 - Generating metacells for reverse-engineering of ARACNe gene regulatory networks
This tutorial explores different approaches to generate metacells from scRNA-seq data. Metacells are used as a means to counteract sparsity in single-cell transcriptomics and improve gene regulatory network (GRN) inference, for instance using the ARACNe3 and ARACNe-AP algorithms for GRN reconstruction by mutual information estimation. After a brief description of the pyVIPER installation procedure and the modules needed, this notebook is organized in the following sections.
Table of Contents
Install Pyviper
Install pyviper
from PyPI using pip. Alternatively, refer to the README in the current GitHub to install from the local directory.
[1]:
#!pip install viper-in-python
Import modules
Load pyviper
and additional modules required used in this tutorial.
[2]:
import pyviper
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rc_context
from sklearn.metrics import silhouette_samples
import warnings
warnings.filterwarnings("ignore", message=".*The 'nopython' keyword.*") # for jit decorator issue with sc.pp.neighbors (09/30/2023)
Step 1. Load a gene expression matrix and associated metadata
Load the same gene expression matrix (UMIs) used in Tutorial 2 and store it into an AnnData object to enable interoperability with scanpy. Cells used in this tutorial were sampled from scRNA-seq data published in Peng et al., 2019.
Display matrix dimensions (cells x genes)
[3]:
gExpr_url = "https://zenodo.org/records/10059791/files/Tutorial_2_counts_mixed_4632.tsv.gz" # path to gene expression matrix (UMI counts)
adata_gExpr = pd.read_csv(gExpr_url, sep="\t") # read from remote
adata_gExpr = sc.AnnData(adata_gExpr) # convert to AnnData object
adata_gExpr
[3]:
AnnData object with n_obs × n_vars = 4632 × 24005
Load cell-associated metadata.
[4]:
metadata_url = "https://zenodo.org/records/10059791/files/Tutorial_2_metadata_mixed_4632.tsv.gz" # path to cells metadata
cells_metadata = pd.read_csv(metadata_url, sep="\t") # load it
Store the metadata in the adata_gExpr
object and display the observation annotation.
[5]:
adata_gExpr.obs = pd.merge(adata_gExpr.obs, cells_metadata, how="left",left_index=True, right_index=True) # store cell-specific metadata as annotation observation
adata_gExpr.obs.head()
[5]:
Cell_Type | |
---|---|
T1_AACCATGCACAACTGT | Ductal cell type 2 |
T1_AAGACCTAGTCATGCT | Ductal cell type 2 |
T1_ACATACGAGACTCGGA | Ductal cell type 2 |
T1_ACCAGTATCTTGCAAG | Ductal cell type 2 |
T1_ACGATGTTCACGAAGG | Ductal cell type 2 |
The observation annotations include the patient from which each cell was sequenced and the annotated cell type. List the number of cells for each cell types.
[6]:
adata_gExpr.obs.groupby('Cell_Type').size().reset_index(name='n') # show cell types and number of cells for each type in AnnData
[6]:
Cell_Type | n | |
---|---|---|
0 | B cell | 1000 |
1 | Ductal cell type 2 | 1455 |
2 | Fibroblast cell | 1277 |
3 | Stellate cell | 900 |
Step 2. Perform basic preprocessing and clustering
Perform some standard preprocessing. The UMI matrix can be processed using the standard Scanpy preprocessing workflow. Since the quality of the provided data was pre-assessed and found to be high, cell filtering will be minimal. For a more detailed explanation of quality control steps, refer to the preprocessing tutorials by Scanpy or Seurat.
[7]:
sc.pp.calculate_qc_metrics(adata_gExpr, inplace=True)
sc.pp.filter_cells(adata_gExpr, min_genes=200) # filter out cells with <200 genes expressed
sc.pp.filter_genes(adata_gExpr, min_counts=10) # filter out genes that are present with <10 UMIs
#adata_gExpr = calcMitoRiboPercent(adata_gExpr)
#adata_gExpr = filterCellsHighMito(adata_gExpr,0.2)
# Store UMI in .raw
adata_gExpr.raw = adata_gExpr
sc.pp.normalize_total(adata_gExpr, inplace=True,target_sum=1e4)
sc.pp.log1p(adata_gExpr)
sc.pp.highly_variable_genes(adata_gExpr, flavor="seurat", n_top_genes=3000, inplace=True)
sc.pp.scale(adata_gExpr)
scRNA-seq data are known to exhibit high sparsity, typically >80-90%. Compute the sparsity of cells labelled as Fibroblasts as the percentage of zero-entries in the dataset.
[8]:
Fibroblast_matrix = adata_gExpr[adata_gExpr.obs['Cell_Type'] == 'Fibroblast cell']
sparsity = np.mean(Fibroblast_matrix.raw.X == 0)
del Fibroblast_matrix
print("Sparsity (fibroblasts, single cells): " + str(round(sparsity,2)*100) + "%")
Sparsity (fibroblasts, single cells): 85.0%
3. Generate Metacells
Compute a distance matrix between each pair of cells on the PCA-projected space in the dataset using the corr_distance
function. Any distance matrix can be used for metacell generation by setting ‘dist_slot’ to the corresponding value in the repr_metacells
function (see below).
[9]:
sc.tl.pca(adata_gExpr, svd_solver='arpack', random_state=0)
pyviper.pp.corr_distance(adata_gExpr) # compute correlation distance
Generate a set of metacell matrices for each cellular population (cluster), as represented by the ‘Cell_Type’ annotation in .obs. The metacell generation function allows several combination of inputs. Specifically, exactly two of the following parameters must be set: 1. ‘size’, 2a. ‘min_median_depth’ or 2b. ‘n_cells_per_metacell’, 3a. ‘perc_data_to_use’ or 3b. ‘perc_incl_data_reused’.
Please notice that 2a-2b and 3a-3b are mutually exclusive, i.e. they cannot be set together.
Example 1: generate 400 metacells per cell population requiring an attempt minimum depth of 12,000 UMI/metacell.
[10]:
n_metacells=400
min_target_depth=12000
pyviper.pp.repr_metacells(adata_gExpr,
counts=None,
pca_slot='X_pca',
dist_slot='corr_dist',
size=n_metacells,
min_median_depth=min_target_depth,
clusters_slot="Cell_Type",
key_added="metacells_approach_1") # size=n_meta, n_cells_per_metacell=k, clusters_slot="putative_tumor_infercnv_output"
cluster metacells: 25%|██▌ | 1/4 [00:00<00:02, 1.17it/s]/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/numpy/core/_methods.py:269: RuntimeWarning: Degrees of freedom <= 0 for slice
ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/numpy/core/_methods.py:226: RuntimeWarning: invalid value encountered in divide
arrmean = um.true_divide(arrmean, div, out=arrmean,
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/numpy/core/_methods.py:261: RuntimeWarning: invalid value encountered in scalar divide
ret = ret.dtype.type(ret / rcount)
cluster metacells: 100%|██████████| 4/4 [00:03<00:00, 1.14it/s]
While the generation of the minimum target depth cannot be guaranteed in all cases, pyVIPER attempts to satisfy this request by tuning the number of nearest neighbors used in metacell generation.
Display adata_gExpr
.
[11]:
adata_gExpr
[11]:
AnnData object with n_obs × n_vars = 4632 × 18160
obs: 'Cell_Type', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'n_genes'
var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'log1p', 'hvg', 'pca', 'metacells_approach_1_B cell', 'metacells_approach_1_Ductal cell type 2', 'metacells_approach_1_Fibroblast cell', 'metacells_approach_1_Stellate cell'
obsm: 'X_pca'
varm: 'PCs'
obsp: 'corr_dist'
4 new matrices, one for each cell type, were stored as Pandas dataframe in adata.uns
.
Display the metacell matrix for the population of Ductal cell type 2.
[12]:
adata_gExpr.uns["metacells_approach_1_Ductal cell type 2"]
[12]:
AL627309.1 | AP006222.2 | RP11-206L10.3 | RP11-206L10.2 | RP11-206L10.9 | LINC00115 | FAM41C | RP11-54O7.3 | SAMD11 | NOC2L | ... | RP11-332K15.1 | RP11-157E21.1 | DMRT2 | RP11-309M23.1 | NLRP7 | IGLL1 | EGFR-AS1 | RSPO2 | KCNC2 | AGTR2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
metacell_0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
metacell_1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
metacell_2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
metacell_3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
metacell_4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
metacell_395 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
metacell_396 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
metacell_397 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
metacell_398 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
metacell_399 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
400 rows × 18160 columns
Each metacell matrix can be used as input to ARACNe3 to reverse engineer the corresponding GRN which can then be provided as an input to VIPER. Each metacell matrix is associated with a dictionary collecting relevant metacell statistics.
Display metacell-associated statistics for the population of fibroblasts.
[13]:
attrs_dict = adata_gExpr.uns["metacells_approach_1_Fibroblast cell"].attrs
attrs_df = pd.DataFrame.from_dict(attrs_dict, orient='index', columns=['Values']) # convert to df
attrs_df
[13]:
Values | |
---|---|
size | 400.000000 |
n_cells_per_metacell | 2.000000 |
min_median_depth | 12000.000000 |
median_depth | 18104.000000 |
perc_total_samples_used | 58.418168 |
perc_included_samples_used_mult_times | 6.970509 |
mean_n_times_samples_used | 1.072386 |
stdev_n_times_samples_used | 0.269274 |
The report is highly informative. The metacell matrix contains 400 metacells, each generated by aggregation of 2 nearest neighbors after requesting an “attempt” minimum target depth of 12,000 UMI/metacell. The median depth for the inferred metacells is ~18,000 UMI/metacell. 58% of the original single-cells from the fibroblasts population was used for metacell generation, with <7% of cells being aggregated in more than 1 metacell.
The data sparsity is expected to be decreased in the metacells.
[14]:
sparsity = np.mean(np.array(adata_gExpr.uns["metacells_approach_1_Fibroblast cell"]) == 0)
print("Sparsity (fibroblasts, metacells example 1): " + str(round(sparsity,2)*100) + "%")
Sparsity (fibroblasts, metacells example 1): 77.0%
Display the statistics for metacells of another cluster (B cells).
[15]:
attrs_dict = adata_gExpr.uns["metacells_approach_1_B cell"].attrs
attrs_df = pd.DataFrame.from_dict(attrs_dict, orient='index', columns=['Values']) # convert to df
attrs_df
[15]:
Values | |
---|---|
size | 400.000000 |
n_cells_per_metacell | 3.000000 |
min_median_depth | 12000.000000 |
median_depth | 11097.500000 |
perc_total_samples_used | 81.800000 |
perc_included_samples_used_mult_times | 30.929095 |
mean_n_times_samples_used | 1.466993 |
stdev_n_times_samples_used | 0.907459 |
As mentioned before, repr_metacells
allows several combinations of parameters. Let’s consider another combination.
Example 2: generate 400 metacells per cluster such that each metacell should incorporate 20 single-cells (
n_cells_per_metacell=20
;min_median_depth=None
).
[16]:
n_cells_per_metacell=20
pyviper.pp.repr_metacells(adata_gExpr,
counts=None,
pca_slot='X_pca',
dist_slot='corr_dist',
size=400,
n_cells_per_metacell=n_cells_per_metacell,
min_median_depth=None,
clusters_slot="Cell_Type",
key_added="metacells_approach_2")
cluster metacells: 100%|██████████| 4/4 [00:07<00:00, 1.79s/it]
Display metacell-associated statistics for the Fibroblast population. The median depth is expected to be much higher than before, due to the larger number of cells per metacells specified.
[17]:
attrs_dict = adata_gExpr.uns["metacells_approach_2_Fibroblast cell"].attrs
attrs_df = pd.DataFrame.from_dict(attrs_dict, orient='index', columns=['Values']) # convert to df
attrs_df
[17]:
Values | |
---|---|
size | 400.000000 |
n_cells_per_metacell | 20.000000 |
min_median_depth | NaN |
median_depth | 196668.500000 |
perc_total_samples_used | 97.963978 |
perc_included_samples_used_mult_times | 89.928058 |
mean_n_times_samples_used | 6.394884 |
stdev_n_times_samples_used | 4.219301 |
Data sparsity is also strongly decreased due to the higher number of nearest neighbors being aggregated.
[18]:
sparsity = np.mean(np.array(adata_gExpr.uns["metacells_approach_2_Fibroblast cell"]) == 0)
print("Sparsity (fibroblasts, metacells example 2): " + str(round(sparsity,2)*100) + "%")
Sparsity (fibroblasts, metacells example 2): 40.0%
Example 3: Generate 400 metacells and allow for only the 20% of the cells to be incorporated in more than 1 metacell (
perc_incl_data_reused=20
)
[19]:
perc_incl_data_reused=20
pyviper.pp.repr_metacells(adata_gExpr,
counts=None,
pca_slot='X_pca',
dist_slot='corr_dist',
size=400,
perc_incl_data_reused=perc_incl_data_reused,
min_median_depth=None,
clusters_slot="Cell_Type",
key_added="metacells_approach_3")
cluster metacells: 100%|██████████| 4/4 [00:04<00:00, 1.08s/it]
Display metacell statistics for the B cell population: the percentage of included cells is now lower than in the example 1. This approach can be useful if the goal is to generate metacells that are independent.
[20]:
attrs_dict = adata_gExpr.uns["metacells_approach_3_B cell"].attrs
attrs_df = pd.DataFrame.from_dict(attrs_dict, orient='index', columns=['Values']) # convert to df
attrs_df
[20]:
Values | |
---|---|
size | 400.000000 |
n_cells_per_metacell | 2.000000 |
min_median_depth | NaN |
median_depth | 7523.000000 |
perc_total_samples_used | 69.800000 |
perc_included_samples_used_mult_times | 12.034384 |
mean_n_times_samples_used | 1.146132 |
stdev_n_times_samples_used | 0.436669 |
Comparing metacell and original single-cells
As a simple and quick sanity check, we compare the original cells and the derived cell metacells. Extract all the metacell matrices generated with the first approach.
[21]:
# Extract and concatenate all population-specific metacell matrices
B_meta=adata_gExpr.uns["metacells_approach_1_B cell"]
Ductal_meta=adata_gExpr.uns["metacells_approach_1_Ductal cell type 2"]
Fibroblast_meta=adata_gExpr.uns["metacells_approach_1_Fibroblast cell"]
Stellate_meta=adata_gExpr.uns["metacells_approach_1_Stellate cell"]
# rename all indexes before merging
B_meta.index = "B_"+B_meta.index
Ductal_meta.index = "Ductal_"+Ductal_meta.index
Fibroblast_meta.index = "Fibroblast_"+Fibroblast_meta.index
Stellate_meta.index = "Stellate_"+Stellate_meta.index
# concatenate all dataframes
metacells_df = pd.concat([B_meta, Ductal_meta, Fibroblast_meta, Stellate_meta], ignore_index=False)
Annotate each metacell with its cell type.
[22]:
# Dictionary mapping index prefixes to cell types
cell_type_map = {
'B': 'B cell',
'Ductal': 'Ductal cell type 2',
'Fibroblast': 'Fibroblast cell',
'Stellate': 'Stellate cell'
}
# Generate the "Cell_Type" column based on the index
cell_types = [cell_type_map[index.split('_')[0]] for index in metacells_df.index]
# metacell observation dataframe
observation_df = pd.DataFrame({'Cell_Type': cell_types}, index=metacells_df.index)
Generate an AnnData object from metacells.
[23]:
adata_metacell = anndata.AnnData(X=metacells_df, obs=observation_df)
Preprocess the concatenated metacells for PCA and UMAP visualization.
[24]:
adata_metacell.raw = adata_metacell
sc.pp.normalize_total(adata_metacell, inplace=True,target_sum=1e4)
sc.pp.log1p(adata_metacell)
sc.pp.highly_variable_genes(adata_metacell, flavor="seurat", n_top_genes=3000, inplace=True)
sc.pp.scale(adata_metacell)
sc.tl.pca(adata_metacell, svd_solver='arpack', random_state=0)
sc.pp.neighbors(adata_metacell, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata_metacell)
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Visualize the original single-cells in PCA and UMAP embeddings.
[25]:
sc.pp.neighbors(adata_gExpr, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata_gExpr)
fig, axs = plt.subplots(1, 2, figsize=(10, 3.5))
with rc_context({'figure.figsize': (3.5, 3.5)}):
sc.pl.pca(adata_gExpr, color="Cell_Type", use_raw=False, palette="Set2", title="PCA by Cell Types", add_outline=True, ax=axs[0], show=False)
with rc_context({'figure.figsize': (3.5, 3.5)}):
sc.pl.umap(adata_gExpr, color="Cell_Type", use_raw=False, palette="Set2", title="UMAP by Cell Types", add_outline=True, ax=axs[1], show=False)
plt.suptitle("Gene expression - single cells")
plt.tight_layout()
plt.show()
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:369: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
ax.scatter(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:379: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
ax.scatter(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:369: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
ax.scatter(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:379: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
ax.scatter(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(
Visualize the cell population-specific metacells in PCA and UMAP embeddings.
[26]:
fig, axs = plt.subplots(1, 2, figsize=(10, 3.5))
with rc_context({'figure.figsize': (3.5, 3.5)}):
sc.pl.pca(adata_metacell, color="Cell_Type", use_raw=False, palette="Set2", title="PCA by Cell Types", add_outline=True, ax=axs[0], show=False)
with rc_context({'figure.figsize': (3.5, 3.5)}):
sc.pl.umap(adata_metacell, color="Cell_Type", use_raw=False, palette="Set2", title="UMAP by Cell Types", add_outline=True, ax=axs[1], show=False)
plt.suptitle("Gene expression - metacells")
plt.tight_layout()
plt.show()
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:369: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
ax.scatter(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:379: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
ax.scatter(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:369: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
ax.scatter(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:379: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
ax.scatter(
/Users/lucazanella7/mambaforge/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
cax = scatter(
The cell type distribution is consistent between single-cells and metacells, with stellate cells and fibroblasts partially overlapping in the PCA as expected, and all the three lineages been separated along the directions of maximal variance.
Key takeaways
This Tutorial describes how to generate metacells with pyVIPER. Metacells can be used for pseudo-bulk analysis of specific cellular populations or as input to ARACNe, when the original cell depth is not adequate to robustly infer regulatory networks (typically below 10,000 UMI/cell). The repr_metacell
function computes several metacell-associated statistics and allows the users to specify several parameter combinations, among which the current Tutorial provides three examples.