pyviper package
pyviper.aREA
- pyviper.aREA(gex_data, interactome, layer=None, eset_filter=False, min_targets=30, mvws=1, verbose=True)
Allows the individual to infer normalized enrichment scores from gene expression data using the Analytical Ranked Enrichment Analysis (aREA)[1] function.
It is the original basis of the VIPER (Virtual Inference of Protein-activity by Enriched Regulon analysis) algorithm.
The Interactome object must not contain any targets that are not in the features of gex_data. This can be accomplished by running:
interactome.filter_targets(gex_data.var_names)
It is highly recommended to do this on the unPruned network and then prune to ensure the pruned network contains a consistent number of targets per regulator, all of which exist within gex_data. A consistent number of targets allows regulators to have NES scores that are comparable to one another. A regulator that has more targets than others will have “boosted” NES scores, such that they cannot be compared to those with fewer targets.
- Parameters:
gex_data – Gene expression stored in an anndata object (e.g. from Scanpy) or in a pd.DataFrame.
interactome – An object of class Interactome.
layer (default: None) – The layer in the anndata object to use as the gene expression input.
eset_filter (default: False) – Whether to filter out genes not present in the interactome (True) or to keep this biological context (False). This will affect gene rankings.
min_targets (default: 30) – The minimum number of targets that each regulator in the interactome should contain. Regulators that contain fewer targets than this minimum will be culled from the network (via the Interactome.cull method). The reason users may choose to use this threshold is because adequate targets are needed to accurately predict enrichment.
mvws (default: 1) – (A) Number indicating either the exponent score for the metaViper weights. These are only applicable when enrichment = ‘area’ and are not used when enrichment = ‘narnea’. Roughly, a lower number (e.g. 1) results in networks being treated as a consensus network (useful for multiple networks of the same celltype with the same epigenetics), while a higher number (e.g. 10) results in networks being treated as separate (useful for multiple networks of different celltypes with different epigenetics). (B) The name of a column in gex_data that contains the manual assignments of samples to networks using list position or network names. (C) “auto”: assign samples to networks based on how well each network allows for sample enrichment.
verbose (default: True) – Whether extended output about the progress of the algorithm should be given.
- Return type:
A dataframe of
DataFrame
containing NES values.
References
[1] Alvarez, M. J., Shen, Y., Giorgi, F. M., Lachmann, A., Ding, B. B., Ye, B. H., & Califano, A. (2016). Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nature genetics, 48(8), 838-847.
pyviper.NaRnEA
- pyviper.NaRnEA(gex_data, interactome, layer=None, eset_filter=False, min_targets=30, verbose=True)
Allows the individual to infer normalized enrichment scores and proportional enrichment scores from gene expression data using the Nonparametric Analytical Rank-based Enrichment Analysis (NaRnEA)[1] function.
NaRnEA is an updated basis for the VIPER (Virtual Inference of Protein-activity by Enriched Regulon analysis) algorithm.
The Interactome object must not contain any targets that are not in the features of gex_data. This can be accomplished by running:
interactome.filter_targets(gex_data.var_names)
It is highly recommend to do this on the unPruned network and then prune to ensure the pruned network contains a consistent number of targets per regulator, all of which exist within gex_data. A regulator that has more targets than others will have “boosted” NES scores, such that they cannot be compared to those with fewer targets.
- Parameters:
gex_data – Gene expression stored in an anndata object (e.g. from Scanpy) or in a pd.DataFrame.
interactome – An object of class Interactome.
layer (default: None) – The layer in the anndata object to use as the gene expression input.
eset_filter (default: False) – Whether to filter out genes not present in the interactome (True) or to keep this biological context (False). This will affect gene rankings.
min_targets (default: 30) – The minimum number of targets that each regulator in the interactome should contain. Regulators that contain fewer targets than this minimum will be culled from the network (via the Interactome.cull method). The reason users may choose to use this threshold is because adequate targets are needed to accurately predict enrichment.
verbose (default: True) – Whether extended output about the progress of the algorithm should be given.
- Returns:
A dictionary containing :class:`~numpy.ndarray` containing NES values (key
- Return type:
‘nes’) and PES values (key: ‘pes’).
References
[1] Griffin, A. T., Vlahos, L. J., Chiuzan, C., & Califano, A. (2023). NaRnEA: An Information Theoretic Framework for Gene Set Analysis. Entropy, 25(3), 542.
pyviper.config
- pyviper.config.set_regulators_filepath(group, species, new_filepath)
Allows the user to use a custom list of regulatory proteins instead of the default ones within pyVIPER’s data folder.
- Parameters:
group – A group of regulatory proteins of either: “tfs”, “cotfs”, “sig” or “surf”.
species – The species to which the group of proteins belongs to: “human” or “mouse”.
new_filepath – The new filepath that should be used to retrieve these sets of proteins.
- Return type:
None
- pyviper.config.set_regulators_species_to_use(species)
Allows the user to specify which species they are currently studying, so the correct sets of regulatory proteins will be used during analysis.
- Parameters:
species – The species to which the group of proteins belongs to: “human” or “mouse”.
- Return type:
None
- pyviper.config.set_regulators_filepath(group, species, new_filepath)
Allows the user to use a custom list of regulatory proteins instead of the default ones within pyVIPER’s data folder.
- Parameters:
group – A group of regulatory proteins of either: “tfs”, “cotfs”, “sig” or “surf”.
species – The species to which the group of proteins belongs to: “human” or “mouse”.
new_filepath – The new filepath that should be used to retrieve these sets of proteins.
- Return type:
None
- pyviper.config.set_regulators_species_to_use(species)
Allows the user to specify which species they are currently studying, so the correct sets of regulatory proteins will be used during analysis.
- Parameters:
species – The species to which the group of proteins belongs to: “human” or “mouse”.
- Return type:
None
pyviper.Interactome
Create an Interactome object to contain the results of ARACNe. This object describes the relationship between regulator proteins (e.g. TFs and CoTFs) and their downstream target genes with mor (Mode Of Regulation, e.g. spearman correlation) indicating directionality and likelihood (e.g. mutual information) indicating weight of association. An Interactome object can be given to pyviper.viper along with a gene expression signature to generate a protein activity matrix with the VIPER (Virtual Inference of Protein-activity by Enriched Regulon analysis) algorithm[1].
- param name:
A filepath to one’s disk to store the Interactome.
- param net_table:
Either (1) a pd.DataFrame containing four columns in this order:
“regulator”, “target”, “mor”, “likelihood”
(2) a filepath to this pd.DataFrame stored either as a .csv, .tsv or .pkl. (3) a filepath to an Interacome object stored as a .pkl.
- type net_table:
default: None
- param input_type:
Only relevant when net_table is a filepath. If None, the input_type will be inferred from the net_table. Otherwise, specify “csv”, “tsv” or “pkl”.
- type input_type:
default: None
References
[1] Alvarez, M. J., Shen, Y., Giorgi, F. M., Lachmann, A., Ding, B. B., Ye, B. H., & Califano, A. (2016). Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nature genetics, 48(8), 838-847.
pyviper.load
- pyviper.load.TFs(species=None, path_to_tfs=None)
Retrieves a list of transcription factors (TFs).
- Parameters:
species (default: None) – When left as None, the species setting in pyviper.config will be used. Otherwise, manually specify “human” or “mouse”.
path_to_tfs (default: None) – When left as None, the path to TFs setting in pyviper.config will be used. Otherwise, manually specify a filepath to a .txt file containing TFs, one on each line.
- Return type:
A list containing transcription factors.
- pyviper.load.coTFs(species=None, path_to_cotfs=None)
Retrieves a list of co-transcription factors (coTFs).
- Parameters:
species (default: None) – When left as None, the species setting in pyviper.config will be used. Otherwise, manually specify “human” or “mouse”.
path_to_cotfs (default: None) – When left as None, the path to coTFs setting in pyviper.config will be used. Otherwise, manually specify a filepath to a .txt file containing coTFs, one on each line.
- Return type:
A list containing co-transcription factors.
- pyviper.load.human2mouse()
Retrieves the human to mouse translation pd.DataFrame from pyVIPER’s data folder. This dataframe contains six columns: human_symbol, mouse_symbol, human_ensembl, mouse_ensembl, human_entrez, mouse_entrez
- Return type:
A dataframe of
DataFrame
.
- pyviper.load.msigdb_regulon(collection)
Retrieves an object or a list of objects of class Interactome from pyviper’s data folder containing a set of pathways from the Molecular Signatures Database (MSigDB), downloaded from https://www.gsea-msigdb.org/gsea/msigdb. These collections can be from one of the following:
‘h’ for Hallmark gene sets. Coherently expressed signatures derived by aggregating many MSigDB gene sets to represent well-defined biological states or processes. ‘c2’ for curated gene sets. From online pathway databases, publications in PubMed, and knowledge of domain experts. ‘c5’ for ontology gene sets. Consists of genes annotated by the same ontology term. ‘c6’ for oncogenic signature gene sets. Defined directly from microarray gene expression data from cancer gene perturbations. ‘c7’ for immunologic signature gene sets. Represents cell states and perturbations within the immune system.
- Parameters:
collection – A individual string or a list of strings containing the following: [“h”, “c2”, “c5”, “c6”, “c7”] corresponding to the collections above.
- Return type:
An individual object or list of objects of class pyviper.interactome.Interactome.
pyviper.pl
- pyviper.pl.__get_stored_uns_data_and_prep_to_plot(adata, uns_data_slot, obsm_slot=None, uns_slot=None)
- pyviper.pl.pca(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.pca.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’] on adata.obsm[‘X_pca’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’] on adata.obsm[‘X_pca’].
**kwargs – Arguments to provide to the sc.pl.pca function.
- Return type:
A plot of
Axes
.
- pyviper.pl.umap(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.umap.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’] on adata.obsm[‘X_umap’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’] on adata.obsm[‘X_umap’].
**kwargs – Arguments to provide to the sc.pl.umap function.
- Return type:
A plot of
Axes
.
- pyviper.pl.tsne(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.tsne.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’] on adata.obsm[‘X_tsne’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’] on adata.obsm[‘X_tsne’].
**kwargs – Arguments to provide to the sc.pl.tsne function.
- Return type:
A plot of
Axes
.
- pyviper.pl.diffmap(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.diffmap.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’] on adata.obsm[‘X_diffmap’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’] on adata.obsm[‘X_diffmap’].
**kwargs – Arguments to provide to the sc.pl.diffmap function.
- Return type:
A plot of
Axes
.
- pyviper.pl.draw_graph(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.draw_graph.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’] on adata.obsm[‘X_draw_graph_fa’] or adata.obsm[‘X_draw_graph_fr’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’] on adata.obsm[‘X_draw_graph_fa’] or adata.obsm[‘X_draw_graph_fr’].
**kwargs – Arguments to provide to the sc.pl.draw_graph function.
- Return type:
A plot of
Axes
.
- pyviper.pl.spatial(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.spatial.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’] on adata.uns[‘spatial’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’] on adata.uns[‘spatial’].
**kwargs – Arguments to provide to the sc.pl.spatial function.
- Return type:
A plot of
Axes
.
- pyviper.pl.embedding(adata, *, basis, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.embedding.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
basis – The name of the represenation in adata.obsm that should be used for plotting.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’] on adata.obsm[basis].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’] on adata.obsm[basis].
**kwargs – Arguments to provide to the sc.pl.embedding function.
- Return type:
A plot of
Axes
.
- pyviper.pl.embedding_density(adata, *, basis='umap', plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.embedding_density.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
basis (default: 'umap') – The name of the represenation in adata.obsm that should be used for plotting.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’] on adata.obsm[basis].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’] on adata.obsm[basis].
**kwargs – Arguments to provide to the sc.pl.embedding_density function.
- Return type:
A plot of
Axes
.
- pyviper.pl.heatmap(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.heatmap.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’].
**kwargs – Arguments to provide to the sc.pl.heatmap function.
- Return type:
A plot of
Axes
.
- pyviper.pl.dotplot(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.dotplot.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’].
**kwargs – Arguments to provide to the sc.pl.dotplot function.
- Return type:
A plot of
Axes
.
- pyviper.pl.tracksplot(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.tracksplot.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’].
**kwargs – Arguments to provide to the sc.pl.tracksplot function.
- Return type:
A plot of
Axes
.
- pyviper.pl.violin(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.violin.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’].
**kwargs – Arguments to provide to the sc.pl.violin function.
- Return type:
A plot of
Axes
.
- pyviper.pl.stacked_violin(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.stacked_violin.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’].
**kwargs – Arguments to provide to the sc.pl.stacked_violin function.
- Return type:
A plot of
Axes
.
- pyviper.pl.matrixplot(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.matrixplot.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’].
**kwargs – Arguments to provide to the sc.pl.matrixplot function.
- Return type:
A plot of
Axes
.
- pyviper.pl.clustermap(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.clustermap.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’].
**kwargs – Arguments to provide to the sc.pl.clustermap function.
- Return type:
A plot of
Axes
.
- pyviper.pl.ranking(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.ranking.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’].
**kwargs – Arguments to provide to the sc.pl.ranking function.
- Return type:
A plot of
Axes
.
- pyviper.pl.dendrogram(adata, *, plot_stored_gex_data=False, plot_stored_pax_data=False, **kwargs)
A wrapper for the scanpy function sc.pl.dendrogram.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
plot_stored_gex_data (default: False) – Plot adata.uns[‘gex_data’].
plot_stored_pax_data (default: False) – Plot adata.uns[‘pax_data’].
**kwargs – Arguments to provide to the sc.pl.dendrogram function.
- Return type:
A plot of
Axes
.
pyviper.pp
- pyviper.pp.rank_norm(adata, NUM_FUN=<function _median>, DEM_FUN=<function _mad_from_R>, layer=None, key_added=None, copy=False)
Compute a double rank normalization on an anndata, np.array, or pd.DataFrame.
- Parameters:
adata – Data stored in an anndata object, np.array or pd.DataFrame.
NUM_FUN (default: np.median) – The first function to be applied across each column.
DEM_FUN (default: _mad_from_R) – The second function to be applied across each column.
layer (default: None) – For an anndata input, the layer to use. When None, the input layer is anndata.X.
key_added (default: None) – For an anndata input, the name of the layer where to store. When None, this is anndata.X.
copy (default: False) – Whether to return a rank-transformed copy (True) or to instead transform the original input (False).
- Returns:
When copy = False, saves the input data as a double rank transformed version.
When copy = True, return a double rank transformed version of the input data.
- pyviper.pp.stouffer(adata, obs_column_name=None, layer=None, filter_by_feature_groups=None, key_added='stouffer', compute_pvals=True, null_iters=1000, verbose=True, return_as_df=False, copy=False)
Compute a stouffer signature on each of your clusters in an anndata object.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object, or a pandas dataframe containing input data.
obs_column_name – The name of the column of observations in adata to use as clusters, or a cluster vector corresponding to observations.
layer (default: None) – The layer to use as input data to compute the signatures.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
key_added (default: 'stouffer') – The slot in adata.uns to store the stouffer signatures.
compute_pvals (default: True) – Whether to compute a p-value for each score to return in the results.
null_iters (default: 1000) – The number of iterations to use to compute a null model to assess the p-values of each of the stouffer scores.
verbose (default: True) – Whether to provide additional output during the execution of the function.
return_as_df (default: False) – If True, returns the stouffer signature in a pd.DataFrame. If False, stores it in adata.var[key_added].
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Return type:
When return_as_df is False, adds the cluster stouffer signatures to adata.var[key_added]. When return_as_df is True, returns as pd.DataFrame.
- pyviper.pp.mwu(adata, obs_column_name=None, layer=None, filter_by_feature_groups=None, key_added='mwu', compute_pvals=True, verbose=True, return_as_df=False, copy=False)
Compute a Mann-Whitney U-Test signature on each of your clusters in an anndata object.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object, or a pandas dataframe containing input data.
obs_column_name – The name of the column of observations in adata to use as clusters, or a cluster vector corresponding to observations.
layer (default: None) – The layer to use as input data to compute the signatures.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
key_added (default: 'mwu') – The slot in adata.uns to store the MWU signatures.
compute_pvals (default: True) – Whether to compute a p-value for each score to return in the results.
verbose (default: True) – Whether to provide additional output during the execution of the function.
return_as_df (default: False) – If True, returns the MWU signature in a pd.DataFrame. If False, stores it in adata.var[key_added].
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Return type:
When return_as_df is False, adds the cluster MWU signatures to adata.var[key_added]. When return_as_df is True, returns as pd.DataFrame.
- pyviper.pp.spearman(adata, pca_slot='X_pca', obs_column_name=None, layer=None, filter_by_feature_groups=None, key_added='stouffer', compute_pvals=True, null_iters=1000, verbose=True, return_as_df=False, copy=False)
Compute spearman correlation between each gene product and the cluster centroids along with the statistical significance for each of your clusters in an anndata object.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object, or a pandas dataframe containing input data.
pca_slot – The slot in adata.obsm where a PCA is stored.
obs_column_name – The name of the column of observations in adata to use as clusters, or a cluster vector corresponding to observations.
layer (default: None) – The layer to use as input data to compute the correlation.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
key_added (default: 'spearman') – The slot in adata.uns to store the spearman correlation.
compute_pvals (default: True) – Whether to compute a p-value for each score to return in the results.
null_iters (default: 1000) – The number of iterations to use to compute a null model to assess the p-values of each of the spearman scores.
verbose (default: True) – Whether to provide additional output during the execution of the function.
return_as_df (default: False) – If True, returns the spearman signature in a pd.DataFrame. If False, stores it in adata.var[key_added].
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Return type:
When return_as_df is False, adds the cluster spearman correlation to adata.var[key_added]. When return_as_df is True, returns as pd.DataFrame.
- pyviper.pp.viper_similarity(adata, nn=None, ws=[4, 2], alternative=['two-sided', 'greater', 'less'], layer=None, filter_by_feature_groups=None, key_added='viper_similarity', copy=False)
Compute the similarity between the columns of a VIPER-predicted activity or gene expression matrix. While following the same concept as the two-tail Gene Set Enrichment Analysis (GSEA)[1], it is based on the aREA algorithm[2].
If ws is a single number, weighting is performed using an exponential function. If ws is a 2 numbers vector, weighting is performed with a symmetric sigmoid function using the first element as inflection point and the second as trend.
- Parameters:
adata – An anndata.AnnData containing protein activity (NES), where rows are observations/samples (e.g. cells or groups) and columns are features (e.g. proteins or pathways).
nn (default: None) – Optional number of top regulators to consider for computing the similarity
ws (default: [4, 2]) – Number indicating the weighting exponent for the signature, or vector of 2 numbers indicating the inflection point and the value corresponding to a weighting score of .1 for a sigmoid transformation, only used if nn is ommited.
alternative (default: 'two-sided') – Character string indicating whether the most active (greater), less active (less) or both tails (two.sided) of the signature should be used for computing the similarity.
layer (default: None) – The layer to use as input data to compute the signatures.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from the input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
key_added (default: "viper_similarity") – The name of the slot in the adata.obsp to store the output.
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Return type:
Saves a signature-based distance numpy.ndarray in adata.obsp[key_added].
References
[1] Julio, M. K. -d. et al. Regulation of extra-embryonic endoderm stem cell differentiation by Nodal and Cripto signaling. Development 138, 3885-3895 (2011).
[2] Alvarez, M. J., Shen, Y., Giorgi, F. M., Lachmann, A., Ding, B. B., Ye, B. H., & Califano, A. (2016). Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nature genetics, 48(8), 838-847.
- pyviper.pp.aracne3_to_regulon(net_file, net_df=None, anno=None, MI_thres=0, regul_size=50, normalize_MI_per_regulon=True)
Process an output from ARACNe3 to return a pd.DataFrame describing a gene regulatory network with suitable columns for conversion to an object of the Interactome class.
- Parameters:
net_file – A string containing the path to the ARACNe3 output
net_df (default: None) – Whether to passt a pd.DataFrame instead of the path
anno (default: None) – Gene ID annotation
MI_thres (default: 0) – Threshold on Mutual Information (MI) to select the regulators and target pairs
regul_size (default: 50) – Number of (top) targets to include in each regulon
normalize_MI_per_regulon (default: True) – Whether to normalize MI values each regulon by the maximum value
- Returns:
A pd.DataFrame containing an ARACNe3-inferred gene regulatory network with the following 4 columns
- Return type:
“regulator”, “target”, “mor” (mode of regulation) and “likelihood”.
- pyviper.pp.nes_to_pval(adata, layer=None, key_added=None, lower_tail=True, adjust=True, axs=1, neg_log=False, copy=False)
Transform VIPER-computed NES into p-values.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object, or a pandas dataframe containing input data, where rows are observations/samples (e.g. cells or groups) and columns are features (e.g. proteins or pathways).
layer (default: None) – Entry of layers to tranform.
key_added (default: None) – Name of layer to save result in a new layer instead of adata.X.
lower_tail (default: True) – If True (default), probabilities are P(X <= x) If False, probabilities are P(X > x)
adjust (default: True) – If True, returns adjusted p values using FDR Benjamini-Hochberg procedure. If False, does not adjust p values
axs (default: 1) – axis along which to perform the p-value correction (Used only if the input is a pd.DataFrame). Possible values are 0 or 1.
neg_log (default: False) – Whether to transform VIPER-computed NES into -log10(p-value).
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Return type:
Saves the input data as a transformed version. If key_added is specified, saves the results in adata.layers[key_added].
- pyviper.pp.repr_subsample(adata, pca_slot='X_pca', size=1000, seed=0, key_added='repr_subsample', eliminate=False, verbose=True, njobs=1, copy=False)
A tool for create a subsample of the input data such it is well representative of all the populations within the input data rather than being a random sample. This is accomplished by pairing samples together in an iterative fashion until the desired sample size is reached.
- Parameters:
adata – An anndata object containing a distance object in adata.obsp.
pca_slot (default: "X_pca") – The slot in adata.obsm where the PCA object is stored. One way of generating this object is with sc.pp.pca.
size (default: 1000) – The size of the representative subsample
eliminate (default: False) – Whether to trim down adata to the subsample (True) or leave the subsample as an annotation in adata.obs[key_added].
seed (default: 0) – The random seed used when taking samples of the data.
verbose (default: True) – Whether to provide runtime information.
njobs (default: 1) – The number of cores to use for the analysis. Using more than 1 core (multicore) speeds up the analysis.
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Returns:
When copy is False, saves the subsample annotation in adata.var[key_added].
When copy is True, return an anndata with this annotation.
When eliminate is True, modify the adata by subsetting it down to the subsample.
- pyviper.pp.repr_metacells(adata, counts=None, pca_slot='X_pca', dist_slot='corr_dist', clusters_slot=None, score_slot=None, score_min_thresh=None, size=500, n_cells_per_metacell=None, min_median_depth=10000, perc_data_to_use=None, perc_incl_data_reused=None, seed=0, key_added='metacells', verbose=True, njobs=1, copy=False)
A tool for create a representative selection of metacells from the data that aims to maximize reusing samples from the data, while simultaneously ensuring that all neighbors are close to the metacell they construct. When using this function, exactly two of the following parameters must be set: size, min_median_depth or n_cells_per_metacell, perc_data_to_use or perc_incl_data_reused. Note that min_median_depth and n_cells_per_metacell cannot both be set at the same time, since they directly relate (e.g. higher n_cells_per_metacell means more neighbors are used to construct a single metacell, meaning each metacell will have more counts, resulting in a higher median depth). Note that perc_data_to_use and perc_incl_data_reused cannot both be set at the same time, since they directly relate (e.g. higher perc_data_to_use means you include more data, which means it’s more likely to reuse more data, resulting in a higher perc_incl_data_reused).
- Parameters:
adata – An anndata object containing a distance object in adata.obsp.
counts (default: None) – A pandas DataFrame or AnnData object of unnormalized gene expression counts that has the same samples in the same order as that of adata. If counts are left as None, adata must have counts stored in adata.raw.
pca_slot (default: "X_pca") – The slot in adata.obsm where the PCA object is stored. One way of generating this object is with sc.pp.pca.
dist_slot (default: "corr_dist") – The slot in adata.obsp where the distance object is stored. One way of generating this object is with pyviper.pp.corr_distance.
clusters_slot (default: None) – The slot in adata.obs where cluster labels are stored. Cluster-specific metacells will be generated using the same parameters with the results for each cluster being stored separately in adata.uns.
score_slot (default: None) – The slot in adata.obs where a score used to determine and filter cell quality are stored (e.g. silhouette score).
score_min_thresh (default: None) – The score from adata.obs[score_slot] that a cell must have at minimum to be used for metacell construction (e.g. 0.25 is the rule of thumb for silhouette score).
size (default: 500) – A specific number of metacells to generate. If set to None, perc_data_to_use or perc_incl_data_reused can be used to specify the size when n_cells_per_metacell or min_median_depth is given.
n_cells_per_metacell (default: None) – The number of cells that should be used to generate single metacell. Note that this parameter and min_median_depth cannot both be set as they directly relate: e.g. higher n_cells_per_metacell leads to higher min_median_depth. If left as None, perc_data_to_use or perc_incl_data_reused can be used to specify n_cells_per_metacell when size is given.
min_median_depth (default: 10000) – The desired minimum median depth for the metacells (indirectly specifies n_cells_per_metacell). The default is set to 10000 as this is recommend by PISCES[1]. Note that this parameter and n_cells_per_metacell cannot both be set as they directly relate: e.g. higher min_median_depth leads to higher n_cells_per_metacell.
perc_data_to_use (default: None) – The percent of the total amount of provided samples that will be used in the creation of metacells. Note that this parameter and perc_incl_data_reused cannot both be set as they directly relate: e.g. higher perc_data_to_use leads to higher perc_incl_data_reused.
perc_incl_data_reused (default: None) – The percent of samples that are included in the creation of metacells that will be reused (i.e. used in more than one metacell). Note that this parameter and perc_data_to_use cannot both be set as they directly relate: e.g. higher perc_incl_data_reused leads to higher perc_data_to_use.
seed (default: 0) – The random seed used when taking samples of the data.
key_added (default: "metacells") – The name of the slot in the adata.uns to store the output.
verbose (default: True) – Whether to provide runtime information and quality statistics.
njobs (default: 1) – The number of cores to use for the analysis. Using more than 1 core (multicore) speeds up the analysis.
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Return type:
Saves the metacells as a pandas dataframe in adata.uns[key_added]. Attributes that contain parameters for and statistics about the construction of the metacells are stored in adata.uns[key_added].attrs. Set copy = True to return a new AnnData object.
References
Obradovic, A., Vlahos, L., Laise, P., Worley, J., Tan, X., Wang, A., & Califano, A. (2021). PISCES: A pipeline for the systematic, protein activity -based analysis of single cell RNA sequencing data. bioRxiv, 6, 22.
pyviper.tl
- pyviper.tl.pca(adata, *, layer=None, filter_by_feature_groups=None, **kwargs)
A wrapper for the scanpy function sc.tl.pca.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
layer (default: None) – The layer to use as input data.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from the input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
**kwargs – Arguments to provide to the sc.tl.pca function.
- pyviper.tl.dendrogram(adata, *, groupby, key_added=None, layer=None, filter_by_feature_groups=None, **kwargs)
A wrapper for the scanpy function sc.tl.dendrogram.
- Parameters:
adata – Gene expression, protein activity or pathways stored in an anndata object.
key_added (default: None) – The key in adata.uns where the dendrogram should be stored.
layer (default: None) – The layer to use as input data.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from the input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
**kwargs – Arguments to provide to the sc.tl.dendrogram function.
- pyviper.tl.oncomatch(pax_data_to_test, pax_data_for_cMRs, tcm_size=50, both_ways=False, om_max_NES_threshold=30, om_min_logp_threshold=0, enrichment='aREA', key_added='om', return_as_df=False, copy=False)
The OncoMatch algorithm[1] assesses the overlap in differentially active MR proteins between two sets of samples (e.g. to validate GEMMs as effective models of human tumor samples). It does so by computing -log10 p-values for each sample in pax_data_to_test of the MRs of each sample in pax_data_for_cMRs.
- Parameters:
pax_data_to_test – An anndata.AnnData or pd.DataFrame containing protein activity (NES), where rows are observations/samples (e.g. cells or groups) and columns are features (e.g. proteins or pathways).
pax_data_for_cMRs – An anndata.AnnData or pd.DataFrame containing protein activity (NES), where rows are observations/samples (e.g. cells or groups) and columns are features (e.g. proteins or pathways).
tcm_size (default: 50) – Number of top MRs from each sample to use to compute regulators.
both_ways (default: False) – Whether to also use the candidate MRs of pax_data_to_test to compute NES for the samples in pax_data_for_cMRs, and then average.
om_max_NES_threshold (default: 30) – The maximum NES scores before using a cutoff.
om_min_logp_threshold (default: 0) – The minimum logp value threshold, such that all logp values smaller than this value are set to 0.
enrichment (default: 'aREA') – The method of compute enrichment. ‘aREA’ or ‘NaRnEA’
key_added (default: 'om') – The slot in pax_data_to_test.obsm to store the oncomatch results.
return_as_df (default: False) – Instead of adding the OncoMatch DataFrame to pax_data_to_test.obsm, return it directly.
copy (default: False) – Determines whether a copy of the input AnnData is returned.
- Return type:
When copy is False, stores a pd.DataFrame objects of -log10 p-values with shape (n_samples in pax_data_to_test, n_samples in pax_data_for_cMRs) in pax_data_to_test.obsm[key_added]. When copy is True, a copy of the AnnData is returned with these pd.DataFrames stored. When return_as_df is True, the OncoMatch DataFrame alone is directly returned by the function.
References
[1] Alvarez, M. J. et al. A precision oncology approach to the pharmacological targeting of mechanistic dependencies in neuroendocrine tumors. Nat Genet 50, 979–989, doi:10.1038/s41588-018-0138-4 (2018).
[2] Alvarez, M. J. et al. Reply to ’H-STS, L-STS and KRJ-I are not authentic GEPNET cell lines’. Nat Genet 51, 1427–1428, doi:10.1038/s41588-019-0509-5 (2019).
- pyviper.tl.find_top_mrs(adata, pca_slot='X_pca', obs_column_name=None, layer=None, N=50, both=True, method='stouffer', key_added='mr', filter_by_feature_groups=None, rank=False, filter_by_top_mrs=False, return_as_df=False, copy=False, verbose=True)
Identify the top N master regulator proteins in a VIPER AnnData object
- Parameters:
adata – An anndata object containing a distance object in adata.obsp.
pca_slot – The slot in adata.obsm where a PCA is stored. Only required when method is “spearman”.
obs_column_name – The name of the column of observations in adata to use as clusters, or a cluster vector corresponding to observations. Required when method is “mwu” or “spearman”.
N (default: 50) – The number of MRs to return
both (default: True) – Whether to return both the top N and bottom N MRs (True) or just the top N (False).
method (default: "stouffer") – The method used to compute a signature to identify the top candidate master regulators (MRs). The options come from functions in pyviper.pp. Choose between “stouffer”, “mwu”, or “spearman”.
key_added (default: "mr") – The name of the slot in the adata.var to store the output.
filter_by_feature_groups (default: None) – The selected regulators, such that all other regulators are filtered out from the input data. If None, all regulators will be included. Regulator sets must be from one of the following: “tfs”, “cotfs”, “sig”, “surf”.
rank (default: False) – When False, a column is added to var with identified MRs labeled as “True”, while all other proteins are labeled as “False”. When True, top MRs are labeled N,N-1,N-2,…,1, bottom MRs are labeled -N,-N-1,-N-2, …,-1, and all other proteins are labeled 0. Higher rank means greater activity, while lower rank means less.
filter_by_top_mrs (default: False) – Whether to filter var to only the top MRs in adata
return_as_df (default: False) – Returns a pd.DataFrame of the top MRs per cluster
copy (default: False) – Determines whether a copy of the input AnnData is returned.
verbose (default: True) – Whether extended output about the progress of the algorithm is given.
- Return type:
Add a column to adata.var[key_added] or, when clusters given, adds multiple columns (e.g. key_added_clust1name, key_added_clust2name, etc) to adata.var. If copy, returns a new adata transformed by this function. If return_as_df, returns a DataFrame.
- pyviper.tl.path_enr(gex_data, pathway_interactome, layer=None, eset_filter=True, method=None, enrichment='aREA', mvws=1, njobs=1, batch_size=10000, verbose=True, output_as_anndata=True, transfer_obs=True, store_input_data=True)
Run the variation of VIPER that is specific to pathway enrichment analysis: a single interactome and min_targets is set to 0.
- Parameters:
gex_data – Gene expression stored in an anndata object (e.g. from Scanpy).
pathway_interactome – An object of class Interactome or one of the following strings that corresponds to msigdb regulons: “c2”, “c5”, “c6”, “c7”, “h”.
layer (default: None) – The layer in the anndata object to use as the gene expression input.
eset_filter (default: False) – Whether to filter out genes not present in the interactome (True) or to keep this biological context (False). This will affect gene rankings.
method (default: None) – A method used to create a gene expression signature from gex_data.X. The default of None is used when gex_data.X is already a gene expression signature. Alternative inputs include “scale”, “rank”, “doublerank”, “mad”, and “ttest”.
enrichment (default: 'aREA') – The algorithm to use to calculate the enrichment. Choose betweeen Analytical Ranked Enrichment Analysis (aREA) and Nonparametric Analytical Rank-based Enrichment Analysis (NaRnEA) function. Default =’aREA’, alternative = ‘NaRnEA’.
mvws (default: 1) – (A) Number indicating either the exponent score for the metaViper weights. These are only applicable when enrichment = ‘aREA’ and are not used when enrichment = ‘NaRnEA’. Roughly, a lower number (e.g. 1) results in networks being treated as a consensus network (useful for multiple networks of the same celltype with the same epigenetics), while a higher number (e.g. 10) results in networks being treated as separate (useful for multiple networks of different celltypes with different epigenetics). (B) The name of a column in gex_data that contains the manual assignments of samples to networks using list position or network names. (C) “auto”: assign samples to networks based on how well each network allows for sample enrichment.
njobs (default: 1) – Number of cores to distribute sample batches into.
batch_size (default: 10000) – Maximum number of samples to process at once. Set to None to split all samples across provided njobs.
verbose (default: True) – Whether extended output about the progress of the algorithm is given.
output_as_anndata (default: True) – Way of delivering output.
transfer_obs (default: True) – Whether to transfer the observation metadata from the input anndata to the output anndata. Thus, not applicable when output_as_anndata==False.
store_input_data (default: True) – Whether to store the input anndata in an unstructured data slot (.uns) of the output anndata. Thus, not applicable when output_as_anndata==False. If input anndata already contains ‘gex_data’ in .uns, the input will assumed to be protein activity and will be stored in .uns as ‘pax_data’. Otherwise, the data will be stored as ‘gex_data’ in .uns.
- Return type:
Returns an AnnData object containing the pathways. When store_input_data, the input gex_data AnnData is stored within the dataframe.
pyviper.viper
- pyviper.viper(gex_data, interactome, layer=None, eset_filter=True, method=None, enrichment='aREA', mvws=1, min_targets=30, njobs=1, batch_size=10000, verbose=True, output_as_anndata=True, transfer_obs=True, store_input_data=True)
The VIPER (Virtual Inference of Protein-activity by Enriched Regulon analysis) algorithm[1] allows individuals to compute protein activity using a gene expression signature and an Interactome object that describes the relationship between regulators and their downstream targets. Users can infer normalized enrichment scores (NES) using Analytical Ranked Enrichment Analysis (aREA)[1] or Nonparametric Analytical Rank-based Enrichment Analysis (NaRnEA)[2]. NaRnEA also compute proportional enrichment scores (PES).
The Interactome object must not contain any targets that are not in the features of gex_data. This can be accomplished by running:
interactome.filter_targets(gex_data.var_names)
It is highly recommend to do this on the unPruned network and then prune to ensure the pruned network contains a consistent number of targets per regulator, allow of which exist within gex_data.
- Parameters:
gex_data – Gene expression stored in an anndata object (e.g. from Scanpy).
interactome – An object of class Interactome or a list of Interactome objects.
layer (default: None) – The layer in the anndata object to use as the gene expression input.
eset_filter (default: False) – Whether to filter out genes not present in the interactome (True) or to keep this biological context (False). This will affect gene rankings.
method (default: None) – A method used to create a gene expression signature from gex_data.X. The default of None is used when gex_data.X is already a gene expression signature. Alternative inputs include “scale”, “rank”, “doublerank”, “mad”, and “ttest”.
enrichment (default: 'aREA') – The algorithm to use to calculate the enrichment. Choose betweeen Analytical Ranked Enrichment Analysis (aREA) and Nonparametric Analytical Rank-based Enrichment Analysis (NaRnEA) function. Default =’aREA’, alternative = ‘NaRnEA’.
mvws (default: 1) – (A) Number indicating either the exponent score for the metaViper weights. These are only applicable when enrichment = ‘aREA’ and are not used when enrichment = ‘NaRnEA’. Roughly, a lower number (e.g. 1) results in networks being treated as a consensus network (useful for multiple networks of the same celltype with the same epigenetics), while a higher number (e.g. 10) results in networks being treated as separate (useful for multiple networks of different celltypes with different epigenetics). (B) The name of a column in gex_data that contains the manual assignments of samples to networks using list position or network names. (C) “auto”: assign samples to networks based on how well each network allows for sample enrichment.
min_targets (default: 30) – The minimum number of targets that each regulator in the interactome should contain. Regulators that contain fewer targets than this minimum will be pruned from the network (via the Interactome.prune method). The reason users may choose to use this threshold is because adequate targets are needed to accurately predict enrichment.
njobs (default: 1) – Number of cores to distribute sample batches into.
batch_size (default: 10000) – Maximum number of samples to process at once. Set to None to split all samples across provided njobs.
verbose (default: True) – Whether extended output about the progress of the algorithm should be given.
output_as_anndata (default: True) – Way of delivering output.
transfer_obs (default: True) – Whether to transfer the observation metadata from the input anndata to the output anndata. Thus, not applicable when output_as_anndata==False.
store_input_data (default: True) – Whether to store the input anndata in an unstructured data slot (.uns) of the output anndata. Thus, not applicable when output_as_anndata==False. If input anndata already contains ‘gex_data’ in .uns, the input will assumed to be protein activity and will be stored in .uns as ‘pax_data’. Otherwise, the data will be stored as ‘gex_data’ in .uns.
- Returns:
A dictionary containing :class:`~numpy.ndarray` containing NES values (key (‘nes’) and PES values (key: ‘pes’) when output_as_anndata=False and enrichment = “NaRnEA”.)
A dataframe of
DataFrame
containing NES values when output_as_anndata=False and enrichment = “aREA”.An anndata object containin NES values in .X when output_as_anndata=True (default). Will contain PES values in the layer ‘pes’ when enrichment = ‘NaRnEA’. Will contain .gex_data and/or .pax_data in the unstructured data slot (.uns) when store_input_data = True. Will contain identical .obs to the input anndata when transfer_obs = True.
References
[1] Alvarez, M. J., Shen, Y., Giorgi, F. M., Lachmann, A., Ding, B. B., Ye, B. H., & Califano, A. (2016). Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nature genetics, 48(8), 838-847.
[2] Griffin, A. T., Vlahos, L. J., Chiuzan, C., & Califano, A. (2023). NaRnEA: An Information Theoretic Framework for Gene Set Analysis. Entropy, 25(3), 542.