Tools¶

Functions for analysis tools.

Tools take as input some annotated data and either return some computed results, or write the results as new annotations within the same data, or return a new annotated data object containing the results. Tools are meant to be composable into a complete processing metacells.pipeline, typically by having one tool create annotation with “agreen upon” name(s) and another further processing them. While this is flexible and convenient, there is no static typing or analysis to ensure that the input of the next tool was actually created by previous tool(s).

All the functions included here are exported under metacells.tl.

General Tools¶

Apply¶

class metacells.tools.apply.DefaultValues(slice: Any, full: Any)[source]¶

Default values to use in apply_obs_annotations() and apply_var_annotations().

slice: Any¶: The default value to use for the slice data.

full: Any¶: The default value to use for the full data.

class metacells.tools.apply.Skip[source]¶: A special value indicating to skip the annotation if it does not exist.

class metacells.tools.apply.Raise[source]¶: A special value indicating to raise a KeyError if an annotation does not exist.

metacells.tools.apply.apply_obs_annotations(adata: AnnData, sdata: AnnData, annotations: Dict[str, DefaultValues], *, indices: str | ndarray | Collection[int] | Collection[float] | PandasSeries) → None[source]¶

Apply per-observation (cell) annotations of a slice sdata to the full adata.

Input

Annotated adata, and a slice of it sdata, where the indices is either the vector of full indices of the slice observations, or the name of a per-observation annotation of sdata that contains this vector.

Computation Parameters

Loop on each of the named annotations, where the value associated with the name is used as the default value (see below).
If the slice data does not contain a per-observation (cell) annotation of this name, consider the DefaultValues.slice:
- If it is Raise, raise a KeyError.
- If it is Skip, do not apply the annotation to the full data.
- Otherwise, behave as if the annotation’s value was a vector containing the DefaultValues.slice value.
If the full data does not contain a per-observation (cell) annotation of this name, consider the DefaultValues.full:
- If it is Raise, raise a KeyError.
- If it is Skip, do not apply the annotation to the full data.
- Otherwise, initialize the annotation to a vector containing the DefaultValues.full value.
Apply the slice data values to the entries of the full data identified by the indices.

metacells.tools.apply.apply_var_annotations(adata: AnnData, sdata: AnnData, annotations: Dict[str, DefaultValues], *, indices: str | ndarray | Collection[int] | Collection[float] | PandasSeries) → None[source]¶

Apply per-variable (gene) annotations of a slice sdata to the full adata.

Input

Annotated adata, and a slice of it sdata, where the indices is either the vector of full indices of the slice variables, or the name of a per-variable annotation of sdata that contains this vector.

Computation Parameters

Loop on each of the named annotations, where the value associated with the name is used as the default value (see below).
If the slice data does not contain a per-variable (gene) annotation of this name, consider the DefaultValues.slice:
- If it is Raise, raise a KeyError.
- If it is Skip, do not apply the annotation to the full data.
- Otherwise, behave as if the annotation’s value was a vector containing the DefaultValues.slice value.
If the full data does not contain a per-variable (gene) variable of this name, consider the DefaultValues.full:
- If it is Raise, raise a KeyError.
- If it is Skip, do not apply the annotation to the full data.
- Otherwise, initialize the annotation to a vector containing the DefaultValues.full value.
Apply the slice data values to the entries of the full data identified by the indices.

Convey¶

metacells.tools.convey.convey_group_to_obs(*, adata: AnnData, gdata: AnnData, group: str = 'metacell', property_name: str, formatter: Callable[[Any], Any] | None = None, to_property_name: str | None = None, default: Any = None) → None[source]¶

Project the value of a property from per-group data to per-observation data.

The input annotated gdata is expected to contain a per-observation (group) annotation named property_name. The input annotated adata is expected to contain a per-observation annotation named group which identifies the group each observation (cell) belongs to.

This will generate a new per-observation (cell) annotation in adata, named to_property_name (by default, the same as property_name), containing the value of the property for the group it belongs to. If the group annotation contains a negative number instead of a valid group index, the default value is used.

metacells.tools.convey.convey_obs_to_obs(*, adata: AnnData, bdata: AnnData, property_name: str, formatter: Callable[[Any], Any] | None = None, to_property_name: str | None = None, default: Any = None) → None[source]¶

Project the value of a property from one annotated data to another.

The observation names are expected to be compatible between adata and bdata. The annotated adata is expected to contain a per-observation (cell) annotation named property_name.

This will generate a new per-observation (cell) annotation in bdata, named to_property_name (by default, the same as property_name), containing the value of the observation with the same name in adata. If no such observation exists, the default value is used.

metacells.tools.convey.convey_obs_to_group(*, adata: ~anndata._core.anndata.AnnData, gdata: ~anndata._core.anndata.AnnData, group: str = 'metacell', property_name: str, formatter: ~typing.Callable[[~typing.Any], ~typing.Any] | None = None, to_property_name: str | None = None, method: ~typing.Callable[[~numpy.ndarray | ~typing.Collection[int] | ~typing.Collection[float] | ~metacells.utilities.typing.PandasSeries], ~typing.Any] = <function most_frequent>) → None[source]¶

Project the value of a property from per-observation data to per-group data.

The input annotated adata is expected to contain a per-observation (cell) annotation named property_name and also a per-observation annotation named group which identifies the group each observation (cell) belongs to, which must be an integer.

This will generate a new per-observation (group) annotation in gdata, named to_property_name (by default, the same as property_name), containing the aggregated value of the property of all the observations (cells) that belong to the group.

The aggregation method (by default, metacells.utilities.computation.most_frequent()) is any function taking an array of values and returning a single value.

metacells.tools.convey.convey_obs_fractions_to_group(*, adata: AnnData, gdata: AnnData, group: str = 'metacell', property_name: str, formatter: Callable[[Any], Any] | None = None, to_property_name: str | None = None) → None[source]¶

Similar to convey_obs_to_group, but create a per-metacell property for each value of the per-cell property, storing the fraction of cells of the metacell that had that value.

The input annotated adata is expected to contain a per-observation (cell) annotation named property_name and also a per-observation annotation named group which identifies the group each observation (cell) belongs to, which must be an integer.

This will generate multiple new per-observation (group) annotation in gdata, named <to_property_name>_fraction_of_<value> (by default, the to_property_name is the same as property_name), containing the fraction of the metacell cells containing the specific property value.

metacells.tools.convey.convey_obs_obs_to_group_group(*, adata: ~anndata._core.anndata.AnnData, gdata: ~anndata._core.anndata.AnnData, group: str = 'metacell', property_name: str, formatter: ~typing.Callable[[~typing.Any], ~typing.Any] | None = None, to_property_name: str | None = None, method: ~typing.Callable[[~numpy.ndarray | ~metacells.utilities.typing.CompressedMatrix | ~metacells.utilities.typing.PandasFrame | ~metacells.utilities.typing.SparseMatrix], ~typing.Any] = <function nanmean_matrix>) → None[source]¶

Project the value of a property from per-observation-per-observation data to per-group-per-group data.

The input annotated adata is expected to contain a per-observation-per-observation (cell) annotation named property_name and also a per-observation annotation named group which identifies the group each observation (cell) belongs to, which must be an integer.

This will generate a new per-observation-per-observation (group) annotation in gdata, named to_property_name (by default, the same as property_name), containing the aggregated value of the property of all the observations (cells) that belong to the group.

The aggregation method (by default, metacells.utilities.computation.nanmean_matrix()) is any function taking a matrix of values and returning a single value.

Filtering the Data¶

Filter¶

metacells.tools.filter.filter_data(adata: AnnData, obs_masks: List[str] = [], var_masks: List[str] = [], *, mask_obs: str | None = None, mask_var: str | None = None, invert_obs: bool = False, invert_var: bool = False, track_obs: str | None = None, track_var: str | None = None, name: str | None = None, top_level: bool = True) → Tuple[AnnData, PandasSeries, PandasSeries] | None[source]¶

Filter (slice) the data based on previously-computed masks.

For example, it is useful to discard cell-cycle genes, cells which have too few UMIs for meaningful analysis, etc. In general, the “best” filter depends on the data set.

This function makes it easy to combine different pre-computed per-observation (cell) and per-variable (gene) boolean mask annotations into a final overall inclusion mask, and slice the data accordingly, while tracking the base index of the cells and genes in the filtered data.

Input

Annotated adata, where the observations are cells and the variables are genes.

Returns

An annotated data containing a subset of the observations (cells) and variables (genes).

If no observations and/or no variables were selected by the filter, returns None.

If name is not specified, the returned data will be unnamed. Otherwise, if the name starts with a ., it will be appended to the current name (if any). Otherwise, name is the new name.

If mask_obs and/or mask_var are specified, store the mask of the selected data as a per-observation and/or per-variable annotation of the full adata.

If track_obs and/or track_var are specified, store the original indices of the selected data as a per-observation and/or per-variable annotation of the result data.

Computation Parameters

Combine the masks in obs_masks and/or var_masks using metacells.tools.mask.combine_masks() passing it invert_obs and invert_var, and mask_obs and mask_var as the to parameter. If either list of masks is empty, use the full mask.
If the obtained masks for either the observations or variables is empty, return None. Otherwise, return a slice of the full data containing just the observations and variables specified by the final masks.

Mask¶

metacells.tools.mask.combine_masks(adata: AnnData, masks: Collection[str], *, invert: bool = False, to: str | None = None) → PandasSeries | None[source]¶

Combine different pre-computed masks into a final overall mask.

Input

Annotated adata, where the observations are cells and the variables are genes.

Returns

If to (default: None) is None, returns the computed mask. Otherwise, sets the mask as an annotation (per-variable or per-observation depending on the type of the combined masks).

Computation Parameters

For each of the mask in masks, in order (left to right), fetch it. Silently ignore missing masks if the name has a ? suffix. If the first character of the mask name is &, restrict the current mask, otherwise the first character must be | and we’ll expand the mask (for the 1st mask, the mask becomes the current mask regardless of the 1st character). If the following character is ~, first invert the mask before applying it.

If invert (default: False), invert the final result mask.

Properly Sampled¶

metacells.tools.properly_sampled.compute_excluded_gene_umis(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__') → None[source]¶: Given an excluded_gene mask, compute the total excluded_umis of each cell.

Detect cells with a “proper” amount of what (default: __x__) data.

Due to both technical effects and natural variance between cells, the total number of UMIs varies from cell to cell. We often would like to work on cells that contain a sufficient number of UMIs for meaningful analysis; we sometimes also wish to exclude cells which have “too many” UMIs.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Observation (Cell) Annotations

properly_sampled_cell: A boolean mask indicating whether each cell has a “proper” amount of UMIs.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas series (indexed by the observation names).

Computation Parameters

Exclude all cells whose total data is less than the min_cell_total (no default), unless it is None.
Exclude all cells whose total data is more than the max_cell_total (no default), unless it is None.
If max_excluded_genes_fraction (no default) is not None, then exclude all cells whose sum of the excluded data (as defined by the excluded_gene mask) divided by the total data is more than the specified threshold.

Detect genes with a “proper” amount of what (default: __x__) data.

Due to both technical effects and natural variance between genes, the expression of genes varies greatly between cells. This is exactly the information we are trying to analyze. We often would like to work on genes that have a sufficient level of expression for meaningful analysis. Specifically, it doesn’t make sense to analyze genes that have zero expression in all the cells.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Variable (Gene) Annotations

properly_sampled_gene: A boolean mask indicating whether each gene has a “proper” number of UMIs.

If inplace (default: True), this is written to the data and the function returns None. Otherwise this is returned as a pandas series (indexed by the variable names).

Computation Parameters

Exclude all genes whose total data is less than the min_gene_total (default: 1).

Named¶

Find genes by their (case-insensitive) name.

This computes a mask of all the genes whose name appears in names or matches any of the patterns. If invert (default: False), invert the resulting mask.

Depending on op, this will set a (compute a brand new) mask, add the result to a mask (which must exist), or remove genes from a mask (which must exist).

If name_property is specified the mask will be based on this per-variable (gene) property.

If to (default: None) is specified, this is stored as a per-variable (gene) annotation with that name, and returns None. This is useful to fill gene masks such as excluded_genes (genes which should be excluded from the rest of the processing), lateral_genes (genes which must not be selected for metacell computation) and noisy_genes (genes which are given more leeway when computing deviant cells).

Otherwise, it returns it as a pandas series (indexed by the variable, that is gene, names).

High¶

Find genes which have high total number of what (default: __x__) data.

This should typically only be applied to downsampled data to ensure that variance in sampling depth does not affect the result.

Genes with too-low expression are typically excluded from computations. In particular, genes may have all-zero expression, in which case including them just slows the computations (and triggers numeric edge cases).

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Variable (Gene) Annotations

high_total_gene: A boolean mask indicating whether each gene was found to have a high normalized variance.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas series (indexed by the variable names).

Computation Parameters

Use metacells.utilities.computation.sum_per() to get the total UMIs of each gene.
Select the genes whose fraction is at least min_gene_total.

Find genes which have high total top-Nth value of what (default: __x__) data.

This should typically only be applied to downsampled data to ensure that variance in sampling depth does not affect the result.

Genes with too-low expression are typically excluded from computations. In particular, genes may have all-zero expression, in which case including them just slows the computations (and triggers numeric edge cases).

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Variable (Gene) Annotations

high_top<topN>_gene: A boolean mask indicating whether each gene was found to have a high top-Nth value.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas series (indexed by the variable names).

Computation Parameters

Use metacells.utilities.computation.top_per() to get the top-Nth UMIs of each gene.
Select the genes whose fraction is at least min_gene_topN.

Find genes which have high fraction of the total what (default: __x__) data of the cells.

Genes with too-low expression are typically excluded from computations. In particular, genes may have all-zero expression, in which case including them just slows the computations (and triggers numeric edge cases).

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Variable (Gene) Annotations

high_fraction_gene: A boolean mask indicating whether each gene was found to have a high normalized variance.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas series (indexed by the variable names).

Computation Parameters

Use metacells.utilities.computation.fraction_per() to get the fraction of each gene.
Select the genes whose fraction is at least min_gene_fraction (default: 1e-05).

metacells.tools.high.find_high_normalized_variance_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, min_gene_normalized_variance: float = 5.656854249492381, inplace: bool = True) → PandasSeries | None[source]¶

Find genes which have high normalized variance of what (default: __x__) data.

The normalized variance measures the variance / mean of each gene. See metacells.utilities.computation.normalized_variance_per() for details.

Genes with a high normalized variance are “bursty”, that is, have significantly different expression level in different cells.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Variable (Gene) Annotations

high_normalized_variance_gene: A boolean mask indicating whether each gene was found to have a high normalized variance.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas series (indexed by the variable names).

Computation Parameters

Use metacells.utilities.computation.normalized_variance_per() to get the normalized variance of each gene.
Select the genes whose normalized variance is at least min_gene_normalized_variance (default: 5.656854249492381).

metacells.tools.high.find_high_relative_variance_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, min_gene_relative_variance: float = 0.1, window_size: int = 100, inplace: bool = True) → PandasSeries | None[source]¶

Find genes which have high relative variance of what (default: __x__) data.

The relative variance measures the variance / mean of each gene relative to the other genes with a similar level of expression. See metacells.utilities.computation.relative_variance_per() for details.

Genes with a high relative variance are good candidates for being selected as “marker genes”, that is, be used to compute the similarity between cells. Using the relative variance compensates for the bias for selecting higher-expression genes, whose normalized variance can to be larger due to random noise alone.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Variable (Gene) Annotations

high_relative_variance_gene: A boolean mask indicating whether each gene was found to have a high relative variance.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas series (indexed by the variable names).

Computation Parameters

Use metacells.utilities.computation.relative_variance_per() to get the relative variance of each gene.
Select the genes whose relative variance is at least min_gene_relative_variance (default: 0.1).

metacells.tools.high.find_metacells_marker_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, min_gene_range_fold: float = 2.0, regularization: float = 1e-05, min_max_gene_fraction: float = 0.0001, inplace: bool = True) → PandasSeries | None[source]¶

Find “marker” genes which have a significant signal in metacells data. This computation is too unreliable to be used on cells.

Find genes which have a high maximal expression in at least one metacell, and a wide range of expression across the metacells. Such genes are good candidates for being used as marker genes and/or to compute distances between metacells.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Variable (Gene) Annotations

marker_gene: A boolean mask indicating whether each gene is a “marker”.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas series (indexed by the variable names).

Computation Parameters

Compute the minimal and maximal expression level of each gene.
Select the genes whose fold factor (log2 of maximal over minimal value, using the regularization (default: 1e-05) is at least min_gene_range_fold (default: 2.0).
Select the genes whose maximal expression is at least min_max_gene_fraction (default: 0.0001).

Noisy Lonely¶

metacells.tools.bursty_lonely.find_bursty_lonely_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, max_sampled_cells: int = 10000, downsample_min_samples: int = 750, downsample_min_cell_quantile: float = 0.5, downsample_max_cell_quantile: float = 0.05, min_gene_total: int = 100, min_gene_normalized_variance: float = 5.656854249492381, max_gene_similarity: float = 0.1, inplace: bool = True, random_seed: int) → PandasSeries | None[source]¶

Detect “bursty lonely” genes based on what (default: __x__) data.

Return the indices of genes which are “bursty” (have high variance compared to their mean) and also “lonely” (have low correlation with all other genes). Such genes should be excluded since they will never meaningfully help us compute groups, and will actively cause profiles to be considered “deviants”.

Noisy genes have high expression and variance. Lonely genes have no (or low) correlations with any other gene. Noisy lonely genes tend to throw off clustering algorithms. In general, such algorithms try to group together cells with the same overall biological state. Since the genes are lonely, they don’t contribute towards this goal. Since they are bursty, they actively hamper this, because they make cells which are otherwise similar appear different (just for this lonely gene).

It is therefore useful to explicitly identify, in a pre-processing step, the (few) such genes, and exclude them from the rest of the analysis.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Variable (Gene) Annotations

bursty_lonely_genes: A boolean mask indicating whether each gene was found to be a “bursty lonely” gene.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas series (indexed by the variable names).

Computation Parameters

If we have more than max_sampled_cells (default: 10000), pick this number of random cells. Specify a non-zero random seed to make this reproducible.
Invoke metacells.tools.downsample.downsample_cells() to downsample the cells to the same total number of UMIs, using the downsample_min_samples (default: 750), downsample_min_cell_quantile (default: 0.5), downsample_max_cell_quantile (default: 0.05) and the random_seed.
Find “bursty” genes which have a total number of UMIs of at least min_gene_total (default: 100) and a normalized variance of at least min_gene_normalized_variance (default: min_gene_normalized_variance).
Cross-correlate the bursty genes.
Find the bursty “lonely” genes whose maximal correlation is at most max_gene_similarity (default: 0.1) with all other genes.

Rare¶

metacells.tools.rare.find_rare_gene_modules(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, max_genes: int = 500, max_gene_cell_fraction: float = 0.001, min_gene_maximum: int = 7, genes_similarity_method: str = 'repeated_pearson', genes_cluster_method: str = 'ward', min_genes_of_modules: int = 4, min_cells_of_modules: int = 12, target_metacell_size: float = 48, target_pile_size: int = 8000, max_cells_factor_of_random_pile: float = 0.5, min_module_correlation: float = 0.1, min_related_gene_fold_factor: float = 7, max_related_gene_increase_factor: float = 4.0, min_cell_module_total: int = 4, inplace: bool = True, reproducible: bool) → Tuple[PandasFrame, PandasFrame] | None[source]¶

Detect rare genes modules based on what (default: __x__) data.

Rare gene modules include genes which are weakly and rarely expressed, yet are highly correlated with each other, allowing for robust detection. Global analysis algorithms (such as metacells) tend to ignore or at least discount such genes.

It is therefore useful to explicitly identify, in a pre-processing step, the few cells which express such rare gene modules. Once identified, these cells can be exempt from the global algorithm, or the global algorithm can be tweaked in some way to pay extra attention to them.

If reproducible is True, a slower (still parallel) but reproducible algorithm will be used to compute pearson correlations.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Obeys (ignores the genes of) the lateral_gene per-gene (variable) annotation, if any.

Returns

Observation (Cell) Annotations

cells_rare_gene_module: The index of the rare gene module each cell expresses the most, or -1 in the common case it does not express any rare genes module.
rare_cell: A boolean mask for the (few) cells that express a rare gene module.

Variable (Gene) Annotations

rare_gene: A boolean mask for the genes in any of the rare gene modules.
rare_gene_module: The index of the rare gene module a gene belongs to (-1 for non-rare genes).

If inplace, these are written to to the data, and the function returns None. Otherwise they are returned as tuple containing two data frames.

Computation Parameters

Pick as candidates all genes that are expressed in at most max_gene_cell_fraction (default: 0.001) of the cells, and whose maximal value in a cell is at least min_gene_maximum (default: 7). If a lateral_gene masks exist, exclude them from the candidates. Out of the candidates, pick at most max_genes (default 500) which are expressed in the least cells.
Compute the similarity between the genes using metacells.tools.similarity.compute_var_var_similarity() using the genes_similarity_method (default: repeated_pearson).
Create a hierarchical clustering of the candidate genes using the genes_cluster_method (default: ward).
Identify gene modules in the hierarchical clustering which contain at least min_genes_of_modules genes (default: 4), with an average gene-gene cross-correlation of at least min_module_correlation (default: 0.1).
Consider cells expressing of any of the genes in the gene module. If the expected number of such cells in each random pile of size target_pile_size (default: 8000), whose total number of UMIs of the rare gene module is at least min_cell_module_total (default: 4), is more than the max_cells_factor_of_random_pile (default: 0.5) as a fraction of the target metacells size, then discard the rare gene module as not that rare after all.
Add to the gene module all genes whose fraction in cells expressing any of the genes in the rare gene module is at least 2^``min_related_gene_fold_factor`` (default: 7) times their fraction in the rest of the population, as long as their maximal value in one of the expressing cells is at least min_gene_maximum, as long as this doesn’t add more than max_related_gene_increase_factor times the original number of cells to the rare gene module, and as long as they are not listed in the lateral_gene masks. If a gene is above the threshold for multiple gene modules, associate it with the gene module for which its fold factor is higher.
Associate cells with the rare gene module if they contain at least min_cell_module_total (default: 4) UMIs of the expanded rare gene module. If a cell meets the above threshold for several rare gene modules, it is associated with the one for which it contains more UMIs.

Building the Graph¶

Downsample¶

metacells.tools.downsample.downsample_cells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, downsample_min_cell_quantile: float = 0.05, downsample_min_samples: float = 750, downsample_max_cell_quantile: float = 0.5, inplace: bool = True, random_seed: int) → Tuple[int, PandasFrame] | None[source]¶

Downsample the values of what (default: __x__) data.

Downsampling is an effective way to get the same number of samples in multiple cells (that is, the same number of total UMIs in multiple cells), and serves as an alternative to normalization (e.g., working with UMI fractions instead of raw UMI counts).

Downsampling is especially important when computing correlations between cells. When there is high variance between the total UMI count in different cells, then normalization will return higher correlation values between cells with a higher total UMI count, which will result in an inflated estimation of their similarity to other cells. Downsampling avoids this effect.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Unstructured Annotations

downsample_samples: The target total number of samples in each downsampled cell.

Variable-Observation (Gene-Cell) Annotations

downsampled: The downsampled data where the total number of samples in each cell is at most downsample_samples.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a tuple with the samples and a pandas data frame (indexed by the cell and gene names).

Computation Parameters

Compute the total samples in each cell.
Decide on the value to downsample to. We would like all cells to end up with at least some reasonable number of samples (total UMIs) downsample_min_samples (default: 750). We’d also like all (most) cells to end up with the highest reasonable downsampled total number of samples, so if possible we increase the number of samples, as long as at most downsample_min_cell_quantile (default: 0.05) cells will have lower number of samples. We’d also like all (most) cells to end up with the same downsampled total number of samples, so if we have to we decrease the number of samples to ensure at most downsample_max_cell_quantile (default: 0.5) cells will have a lower number of samples.
Downsample each cell so that it has at most the selected number of samples. Specify a non-zero random_seed to make this reproducible.

Cross-Similarity¶

metacells.tools.similarity.compute_obs_obs_similarity(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, method: str = 'abs_pearson', logistics_location: float = 0.8, logistics_slope: float = 0.5, top: int | None = None, bottom: int | None = None, inplace: bool = True, reproducible: bool) → PandasFrame | None[source]¶

Compute a measure of the similarity between the observations (cells) of what (default: __x__).

If reproducible is True, a slower (still parallel) but reproducible algorithm will be used to compute Pearson correlations.

The method (default: abs_pearson) can be one of: * pearson for computing Pearson correlation. * abs_pearson for computing the absolute Pearson correlation. * repeated_pearson for computing correlations-of-correlations. * repeated_abs_pearson for computing absolute correlations-of-correlations. * logistics for computing the logistics function. * logistics_pearson for computing correlations-of-logistics. * logistics_abs_pearson for computing absolute correlations-of-logistics.

If using the logistics function, use the logistics_slope (default: 0.5) and logistics_location (default: 0.8).

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Observations-Pair (cells) Annotations

obs_similarity: A square matrix where each entry is the similarity between a pair of cells.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas data frame (indexed by the observation names).

Computation Parameters

If method (default: abs_pearson) is logistics or logistics_pearson, compute the mean value of the logistics function between the variables of each pair of observations (cells). Otherwise, it should be pearson or repeated_pearson, so compute the cross-correlation between all the observations.
If the method is logistics_pearson or repeated_pearson, then compute the cross-correlation of the results of the previous step. That is, two observations (cells) will be similar if they are similar to the rest of the observations (cells) in the same way. This compensates for the extreme sparsity of the data.
If top and/or bottom are specified, keep just these number of most-similar and/or least-similar values in each row (turning the result into a compressed matrix format).

metacells.tools.similarity.compute_var_var_similarity(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, method: str = 'abs_pearson', logistics_location: float = 0.8, logistics_slope: float = 0.5, top: int | None = None, bottom: int | None = None, inplace: bool = True, reproducible: bool) → PandasFrame | None[source]¶

Compute a measure of the similarity between the variables (genes) of what (default: __x__).

If reproducible is True, a slower (still parallel) but reproducible algorithm will be used to compute Pearson correlations.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

The method (default: abs_pearson) can be one of: * pearson for computing Pearson correlation. * abs_pearson for computing the absolute Pearson correlation. * repeated_pearson for computing correlations-of-correlations. * repeated_abs_pearson for computing absolute correlations-of-correlations. * logistics for computing the logistics function. * logistics_pearson for computing correlations-of-logistics. * logistics_abs_pearson for computing absolute correlations-of-logistics.

If using the logistics function, use the logistics_slope (default: 0.5) and logistics_location (default: 0.8).

Returns

Variable-Pair (genes) Annotations

var_similarity: A square matrix where each entry is the similarity between a pair of genes.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas data frame (indexed by the variable names).

Computation Parameters

If method (default: abs_pearson) is logistics or logistics_pearson, compute the mean value of the logistics function between the variables of each pair of variables (genes). Otherwise, it should be pearson or repeated_pearson, so compute the cross-correlation between all the variables.
If the method is logistics_pearson or repeated_pearson, then compute the cross-correlation of the results of the previous step. That is, two variables (genes) will be similar if they are similar to the rest of the variables (genes) in the same way. This compensates for the extreme sparsity of the data.
If top and/or bottom are specified, keep just these number of most-similar and/or least-similar values in each row (turning the result into a compressed matrix format).

K-Nearest-Neighbors Graph¶

metacells.tools.knn_graph.compute_obs_obs_knn_graph(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = 'obs_similarity', *, k: int, balanced_ranks_factor: float = 3.1622776601683795, incoming_degree_factor: float = 3.0, outgoing_degree_factor: float = 1.0, min_outgoing_degree: int = 2, inplace: bool = True) → PandasFrame | None[source]¶

Compute a directed K-Nearest-Neighbors graph based on what (default: what) similarity data for each pair of observations (cells).

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-observation-per-observation matrix or the name of a per-observation-per-observation annotation containing such a matrix.

Returns

Observations-Pair Annotations

obs_outgoing_weights: A sparse square matrix where each non-zero entry is the weight of an edge between a pair of cells or genes, where the sum of the weights of the outgoing edges for each element is 1 (there is always at least one such edge).

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas data frame (indexed by the observation names).

Computation Parameters

Use the obs_similarity and convert it to ranks (in descending order). This gives us a dense asymmetric <elements>_outgoing_ranks matrix.
Convert the asymmetric outgoing ranks matrix into a symmetric obs_balanced_ranks matrix by element-wise multiplying it with its transpose and taking the square root. That is, for each edge to be high-balanced-rank, the geomean of its outgoing rank has to be high in both nodes it connects.

Note

This can drastically reduce the degree of the nodes, since to survive an edge needs to have been in the top ranks for both its nodes (as multiplying with zero drops the edge). This is why the balanced_ranks_factor needs to be large-ish.
Keeping only balanced ranks of geomean of up to k * balanced_ranks_factor (default: 3.1622776601683795). This does a preliminary pruning of low-quality edges.
Prune the edges, keeping only the k * incoming_degree_factor (default: k * 3.0) highest-ranked incoming edges for each node, and then only the k * outgoing_degree_factor (default: 1.0) highest-ranked outgoing edges for each node, while ensuring that the highest-balanced-ranked outgoing edge of each node is preserved. This gives us an asymmetric obs_pruned_ranks matrix, which has the structure we want, but not the correct edge weights yet.

Note

Balancing the ranks, and then pruning the incoming edges, ensures that “hub” nodes, that is nodes that many other nodes prefer to connect with, end up connected to a limited number of such “spoke” nodes.
If there is any node which is left with an out degree of less than min_outgoing_degree (default: 2), increase K by 10% and repeat steps 2-4.
Normalize the outgoing edge weights by dividing them with the sum of their balanced ranks, such that the sum of the outgoing edge weights for each node is 1. Note that there is always at least one outgoing edge for each node. This gives us the obs_outgoing_weights for our directed K-Nearest-Neighbors graph.

Note

Ensuring each node has at least one outgoing edge allows us to always have at least one candidate grouping to add it to. This of course doesn’t protect the node from being rejected by its group as deviant.

metacells.tools.knn_graph.compute_var_var_knn_graph(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = 'var_similarity', *, k: int, balanced_ranks_factor: float = 3.1622776601683795, incoming_degree_factor: float = 3.0, outgoing_degree_factor: float = 1.0, min_outgoing_degree: int = 2, inplace: bool = True) → PandasFrame | None[source]¶

Compute a directed K-Nearest-Neighbors graph based on what (default: what) similarity data for each pair of variables (genes).

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-variable matrix or the name of a per-variable-per-variable annotation containing such a matrix.

Returns

Variables-Pair Annotations

var_outgoing_weights: A sparse square matrix where each non-zero entry is the weight of an edge between a pair of cells or genes, where the sum of the weights of the outgoing edges for each element is 1 (there is always at least one such edge).

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas data frame (indexed by the variable names).

Computation Parameters

Use the var_similarity and convert it to ranks (in descending order). This gives us a dense asymmetric <elements>_outgoing_ranks matrix.
Convert the asymmetric outgoing ranks matrix into a symmetric var_balanced_ranks matrix by element-wise multiplying it with its transpose and taking the square root. That is, for each edge to be high-balanced-rank, the geomean of its outgoing rank has to be high in both nodes it connects.
Keeping only balanced ranks of up to k * k * balanced_ranks_factor (default: 3.1622776601683795). This does a preliminary pruning of low-quality edges.
Prune the edges, keeping only the k * incoming_degree_factor (default: k * 3.0) highest-ranked incoming edges for each node, and then only the k * outgoing_degree_factor (default: 1.0) highest-ranked outgoing edges for each node, while ensuring that the highest-balanced-ranked outgoing edge of each node is preserved. This gives us an asymmetric var_pruned_ranks matrix, which has the structure we want, but not the correct edge weights yet.

Note

Balancing the ranks, and then pruning the incoming edges, ensures that “hub” nodes, that is nodes that many other nodes prefer to connect with, end up connected to a limited number of such “spoke” nodes.
If there is any node which is left with an out degree of less than min_outgoing_degree (default: 2), increase K by 10% and repeat steps 2-4.
Normalize the outgoing edge weights by dividing them with the sum of their balanced ranks, such that the sum of the outgoing edge weights for each node is 1. Note that there is always at least one outgoing edge for each node. This gives us the var_outgoing_weights for our directed K-Nearest-Neighbors graph.

Note

Ensuring each node has at least one outgoing edge allows us to always have at least one candidate grouping to add it to. This of course doesn’t protect the node from being rejected by its group as deviant.

Computing the Metacells¶

Candidates¶

metacells.tools.candidates.compute_candidate_metacells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = 'obs_outgoing_weights', *, target_metacell_size: int = 48, min_metacell_size: int = 12, target_metacell_umis: int = 160000, cell_umis: ndarray | None = None, min_seed_size_quantile: float = 0.85, max_seed_size_quantile: float = 0.95, cooldown_pass: float = 0.02, cooldown_node: float = 0.25, cooldown_phase: float = 0.75, increase_phase: float = 1.01, min_split_size_factor: float = 2.0, max_merge_size_factor: float = 0.5, max_split_min_cut_strength: float = 0.1, min_cut_seed_cells: int = 7, must_complete_cover: bool = False, random_seed: int, inplace: bool = True) → PandasSeries | None[source]¶

Assign observations (cells) to (raw, candidate) metacells based on what data. (a weighted directed graph).

These candidate metacells typically go through additional vetting (e.g. deviant detection and dissolving too-small metacells) to obtain the final metacells.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-observation-per-observation matrix where each row is the outgoing weights from each observation to the rest, or just the name of a per-observation-per-observation annotation containing such a matrix. Typically this matrix will be sparse for efficient processing.

Returns

Observation (Cell) Annotations

candidate: The integer index of the (raw, candidate) metacell each cell belongs to. The metacells are in no particular order.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas series (indexed by the variable names).

Computation Parameters

If cell_umis is not specified, use the sum of the what data for each cell.
We are trying to create metacells of size target_metacell_size cells and target_metacell_umis UMIs each. Compute the UMIs of the metacells by summing the cell_umis.
We start with some an assignment of cells to seeds using choose_seeds() using min_seed_size_quantile (default: 0.85) and max_seed_size_quantile (default: 0.95) to compute them, picking a number of seeds such that the average metacell size would match the target.
We optimize the seeds using optimize_partitions() to obtain initial communities by maximizing the “stability” of the solution (probability of starting at a random node and moving either forward or backward in the graph and staying within the same metacell, divided by the probability of staying in the metacell if the edges connected random nodes). We pass it the cooldown_pass 0.02) and cooldown_node (default: 0.25).
If min_split_size_factor (default: 2.0) is specified, split to two each community whose size is partition method on each community whose size is at least target_metacell_size * min_split_size_factor or whose UMIs are at least target_metacell_umis * min_split_size_factor, as long as half of the community is at least the min_metacell_size (default: 12). Then, re-optimize the solution (resulting in an additional metacells). Every time we re-optimize, we multiply 1 - cooldown_pass by 1 - cooldown_phase (default: 0.75).
Using max_split_min_cut_strength (default: 0.1), if the minimal cut of a candidate is lower, split it into two. If one of the partitions is smaller than min_cut_seed_cells, then mark the cells in it as outliers, or if must_complete_cover is True, skip the cut altogether.
Using max_merge_size_factor (default: 0.5) and min_metacell_size (default: 12), make outliers of cells of a community whose size is at most target_metacell_size * max_merge_size_factor and whose UMIs are at most target_metacell_umis * max_merge_size_factor, or that contain less cells than min_metacell_size. Again, re-optimize, which will assign these cells to other metacells (resulting on one less metacell). We again apply the cooldown_phase every time we re-optimize.
Repeat the above steps until all metacells candidates are in the acceptable size range.

metacells.tools.candidates.choose_seeds(*, edge_weights: CompressedMatrix, seed_of_cells: ndarray | None = None, max_seeds_count: int, min_seed_size_quantile: float = 0.85, max_seed_size_quantile: float = 0.95, random_seed: int) → ndarray[source]¶

Choose initial assignment of cells to seeds based on the edge_weights.

Returns a vector assigning each node (cell) to a seed (initial community).

If seed_of_cells is specified, it is expected to contain a vector of partial seeds. Only cells which have a negative seed will be assigned a new seed. New seeds will be created so that the total number of seeds will not exceed max_seeds_count. The seed_of_cells will be modified in-place and returned.

Otherwise, a new vector is created, initialized with -1 (that is, no seed) for all nodes, filled as above, and returned.

Computation Parameters

We compute for each candidate node the number of nodes it is connected to (by an outgoing edge).
We pick as a seed a random node whose number of connected nodes (“seed size”) quantile is at least min_seed_size_quantile and at most max_seed_size_quantile. This ensures we pick seeds that aren’t too small or too large to get a good coverage of the population with a low number of seeds.
We assign each of the connected nodes to their seed, and discount them from the number of connected nodes of the remaining unassigned nodes.
We repeat this until we reach the target number of seeds.

metacells.tools.candidates.optimize_partitions(*, edge_weights: CompressedMatrix, community_of_nodes: ndarray, node_umis: ndarray, low_partition_umis: int, target_partition_umis: int, high_partition_umis: int, low_partition_size: int, target_partition_size: int, high_partition_size: int, cooldown_pass: float = 0.02, cooldown_node: float = 0.25, random_seed: int) → float[source]¶

Optimize partition to candidate metacells (communities) using the edge_weights.

Returns the score of the optimized partition.

This modifies the community_of_nodes in-place.

The goal is to minimize the “stability” goal function which is defined to be the ratio between (1) the probability that, selecting a random node and either a random outgoing edge or a random incoming edge (biased by their weights), that the node connected to by that edge is in the same community (metacell) and (2) the probability that a random edge would lead to this same community (the fraction of its number of nodes out of the total).

To maximize this, we repeatedly pass on a randomized permutation of the nodes, and for each node, move it to a random “better” community. When deciding if a community is better, we consider both (1) just the “local” product of the sum of the weights of incoming and outgoing edges between the node and the current and candidate communities and (2) the effect on the “global” goal function (considering the impact on this product for all other nodes connected to the current node).

We define a notion of temperature (initially, 1 - cooldown_pass, default: {cooldown_pass}) and we give a weight of temperature to the local score and (1 - temperature) to the global score. When we move to the next node, we multiply the temperature by 1 - cooldown_pass. If we did not move the node, we multiply its temperature by cooldown_node (default: {cooldown_node}). We skip looking at nodes which are colder from the global temperature to accelerate the algorithm. If we don’t move any node, we reduce the global temperature below that of any cold node; if there are no such nodes, we reduce it to zero to perform a final hill-climbing phase.

This simulated-annealing-like behavior helps the algorithm to escape local maximums, although of course no claim is made of achieving the global maximum of the goal function.

metacells.tools.candidates.score_partitions(*, node_umis: ndarray, low_partition_umis: float, target_partition_umis: float, high_partition_umis: float, low_partition_size: int, target_partition_size: int, high_partition_size: int, edge_weights: CompressedMatrix, partition_of_nodes: ndarray, temperature: float, with_orphans: bool = True) → None[source]¶

Compute the “stability” the “stability” goal function which is defined to be the ratio between (1) the probability that, selecting a random node and either a random outgoing edge or a random incoming edge (biased by their weights), that the node connected to by that edge is in the same community (metacell) and (2) the probability that a random edge would lead to this same community (the fraction of its number of nodes out of the total).

If with_orphans is True (the default), outlier nodes are included in the computation. In general we add 1e-6 to the product of the incoming and outgoing weights so we can safely log it for efficient computation; thus orphans are given a very small (non-zero) weight so the overall score is not zeroed even when including them.

Deviants¶

metacells.tools.deviants.find_deviant_cells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, candidates: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'candidate', min_gene_fold_factor: float = 3.0, min_compare_umis: int = 8, gap_skip_cells: int = 1, min_noisy_gene_fold_factor: float = 2.0, max_gene_fraction: float = 0.03, max_cell_fraction: float | None = 0.25, max_gap_cells_count: int = 3, max_gap_cells_fraction: float = 0.1, cells_regularization_quantile: float = 0.25, policy: str = 'gaps') → ndarray | Collection[int] | Collection[float] | PandasSeries[source]¶

Find cells which are have significantly different gene expression from the metacells they are belong to based on what (default: __x__) data.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Obeys (ignores the genes of) the noisy_gene per-gene (variable) annotation, if any.

The exact method depends on the policy (one of gaps or max). By default we use the gaps policy as it gives a much lower fraction of deviants at a minor cost in the variance inside each metacell. The max policy provides the inverse trade-off, giving slightly more consistent metacells at the cost of a much higher fraction of deviants.

Returns

A boolean mask of all the cells which should be considered “deviant”.

Gaps Computation Parameters

Intuitively, for each gene for each metacell we can look at the sorted expression level of the gene in all the metacell’s cells. We look for a large gap between a few low-expressing or high-expressing cells and the rest of the cells. If we find such a gap, the few cells below or above it are considered to be deviants.

For each gene in each cell of each metacell, compute the log (base 2) of the fraction of the gene’s UMIs out of the total UMIs of the metacell, with a 1-UMI regularization factor.
Sort the expression level of each gene in each metacell.
Look for a gap of at least min_gene_fold_factor (default: 3.0), or for noisy_gene, an additional min_noisy_gene_fold_factor (default: 2.0) between the sorted gene expressions. If gap_skip_cells (default: 1) is 0, look for a gap between consecutive sorted cell expression levels. If it is 1 or 2, skip this number of entries. Ignore gaps if the total number of UMIs of the gene in the two compared cells is less than min_compare_umis (default: 8).
Ignore gaps that cause more than max_gap_cells_fraction (default: 0.1) and also more than max_gap_cells_count (default: 3) to be separated. That is, a single gene can only mark as deviants “a few” cells of the metacell.
If any cells were marked as deviants, re-run the above, ignoring any cells previously marked as deviants.
If the total number of cells is more than max_cell_fraction (default: 0.25) of the cells, increase min_gene_fold_factor by 0.15 (~x1.1) and try again from the top.

Max Computation Parameters

Compute for each candidate metacell the median fraction of the UMIs expressed by each gene. Scale this by each cell’s total UMIs to compute the expected number of UMIs for each cell. Compute the fold factor log2((actual UMIs + 1) / (expected UMIs + 1)) for each gene for each cell.
Compute the excess fold factor for each gene in each cell by subtracting min_gene_fold_factor (default: 3.0) from the above. For noisy_gene, also subtract min_noisy_gene_fold_factor to the threshold.

For each cell, consider the maximal gene excess fold factor. Consider all cells with a positive maximal threshold as deviants. If more than max_cell_fraction (default: 0.25) of the cells have a positive maximal excess fold factor, increase the threshold from 0 so that only this fraction are marked as deviants.

Dissolve¶

Dissolve too-small metacells based on what (default: __x__) data.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Sets the following in adata:

Observation (Cell) Annotations

metacell: The integer index of the metacell each cell belongs to. The metacells are in no particular order. Cells with no metacell assignment are given a metacell index of -1.
dissolved: A boolean mask of the cells which were in a dissolved metacell.

Computation Parameters

If cell_umis is not specified, use the sum of the what data for each cell.
Mark all deviants cells “outliers”. This can be the name of a per-observation (cell) annotation, or an explicit boolean mask of cells, or a or None if there are no deviant cells to mark.
Any metacell which has less cells than the min_metacell_size is dissolved into outlier cells.
If min_convincing_gene_fold_factor is not None, preserve everything else. Otherwise:
We are trying to create metacells of size target_metacell_size cells and target_metacell_umis UMIs each. Compute the UMIs of the metacells by summing the cell_umis.
Using min_robust_size_factor (default: 0.5), any metacell whose total size is at least target_metacell_size * min_robust_size_factor or whose total UMIs are at least target_metacell_umis * min_robust_size_factor is preserved.
Using min_convincing_gene_fold_factor, preserve any remaining metacells which have at least one gene whose fold factor (log2((actual + 1) / (expected_by_overall_population + 1))) is at least this high.

Dissolve the remaining metacells into outlier cells.

Evaluating the Metacells¶

Group¶

Compute new data which has the what (default: {what}) sum of the observations (cells) for each group.

For example, having computed a metacell index for each cell, compute the per-metacell data for further analysis.

If groups is a string, it is expected to be the name of a per-observation vector annotation. Otherwise it should be a vector. The group indices should be integers, where negative values indicate “no group” and non-negative values indicate the index of the group to which each observation (cell) belongs to.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

An annotated data where each observation is the sum of the group of original observations (cells). Observations with a negative group index are discarded. If all observations are discarded, return None.

The new data will contain only:

A single observation for each group. The name of each observation will be the optional prefix (default: {prefix}), followed by the group’s index, followed by . and a 2-digit checksum of the grouped members.
An X member holding the summed-per-group data.
A new grouped per-observation data which counts, for each group, the number of grouped observations summed into it.

If name is not specified, the data will be unnamed. Otherwise, if it starts with a ., it will be appended to the current name (if any). Otherwise, name is the new name.

metacells.tools.group.group_obs_annotation(adata: AnnData, gdata: AnnData, *, groups: str | ndarray | Collection[int] | Collection[float] | PandasSeries, name: str, formatter: Callable[[Any], Any] | None = None, method: str = 'majority', min_value_fraction: float = 0.5, conflict: Any | None = None, inplace: bool = True) → PandasSeries | None[source]¶

Transfer per-observation data from the per-observation (cell) adata to the per-group-of-observations (metacells) gdata.

Input

Annotated adata, where the observations are cells and the variables are genes, and the gdata containing the per-metacells summed data.

Returns

Observations (Cell) Annotations

<name>: The per-group-observation annotation computed based on the per-observation annotation.

If inplace (default: True), this is written to the gdata, and the function returns None. Otherwise this is returned as a pandas series (indexed by the group observation names).

Computation Parameters

Iterate on all the observations (groups, metacells) in gdata.
Consider all the cells whose groups annotation maps them into this group.
Consider all the name annotation values of these cells.
Compute an annotation value for the whole group of cells using the method. Supported methods are:

unique
All the values of all the cells in the group are expected to be the same, use this unique value for the whole groups.

majority
Use the most common value across all cells in the group as the value for the whole group. If this value doesn’t have at least min_value_fraction (default: 0.5) of the cells, use the conflict (default: None) value instead.

Quality¶

Compute the standard deviation of the log (base 2) of the fraction of each gene in the cells of the metacell.

Ideally, the standard deviation should be ~1/3rd of the deviants_min_gene_fold_factor (which is 3 by default), indicating that (all)most cells are within that maximal fold factor. In practice we may see higher values.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation (UMIs) matrix or the name of a per-variable-per-observation annotation containing such a matrix.

In addition, gdata is assumed to have one (fraction) observation for each metacell, a total_umis per metacell, and use the same genes as adata.

Returns

Sets the following in gdata:

Per-Variable Per-Observation (Gene-Cell) Annotations

inner_stdev_log
For each gene and metacell, the normalized variance (variance over mean) of the gene in the metacell, if it has a sufficient number of UMIs to make this meaningful (otherwise, is 0).

Computation Parameters

For each metacell:

Compute the log (base 2) of the fractions of the UMIs of each gene in each cell, regularized by 1 UMI.
Compute the standard deviation of these logs for each gene across all cells of each metacell.

metacells.tools.quality.compute_projected_folds(qdata: AnnData, from_query_layer: str = 'corrected_fraction', to_query_layer: str = 'projected_fraction', fold_regularization: float = 1e-05, min_significant_gene_umis: float = 40) → None[source]¶

Compute the projected fold factors of genes for each query metacell.

This computes, for each metacell of the query, the fold factors between the corrected and projected gene fractions projection of the metacell onto the atlas (see metacells.tools.project.compute_projection_weights()).

Input

Annotated query qdata, where the observations are query metacells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

In addition, the projected UMIs of each query metacells onto the atlas.

Returns

Sets the following in qdata:

Per-Variable Per-Observation (Gene-Cell) Annotations

projected_fold: For each gene and query metacell, the fold factor of this gene between the query and its projection.

Computation Parameters

For each group (metacell), for each gene, compute the gene’s fold factor log2((from_query_layer (default: corrected_fraction) + fold_regularization) / (to_query_layer (default: projected_fraction) fractions + fold_regularization)), similarly to metacells.tools.project.compute_projection_weights() (the default fold_regularization is 1e-05).
Set the fold factor to zero for every case where the total UMIs of the gene in the query metacell are not at least min_significant_gene_umis (default: 40).

metacells.tools.quality.compute_similar_query_metacells(qdata: AnnData, max_projection_fold_factor: float = 3.0, max_projection_noisy_fold_factor: float = 2.0, min_fitted_query_marker_genes: float = 0, max_misfit_genes: int = 3, essential_genes_property: None | str | Collection[str] = None, min_essential_genes: int | None = None, fitted_genes_mask: ndarray | None = None) → None[source]¶

Mark query metacells that are “similar” to their projection on the atlas.

This does not guarantee the query metacell is “the same as” its projection on the atlas; rather, it means the two are “sufficiently similar” that one can be reasonably confident in applying atlas metadata to the query metacell based on the projection.

Input

Annotated query qdata, where the observations are metacells and the variables are genes.

The data should contain per-observation-per-variable annotations projected_fold with the significant projection folds factors, as computed by compute_projected_folds(). If min_essential_significant_genes_fraction, and essential_genes_property are specified, then the data may contain additional per-observation (gene) mask(s) denoting the essential genes.

If a projected_noisy_gene mask exists, then the genes in it allow for a higher fold factor than normal genes.

Returns

Sets the following in qdata:

Per-Observation (Cell) Annotations

similar
A boolean mask indicating the query metacell is similar to its projection in the atlas.

Per-Variable Per-Observation (Gene-Cell) Annotations

misfit: Whether the gene has a too-high fold factor between the query and its projection in the atlas.

Computation Parameters

If fitted_genes_mask is not None, restrict the analysis to the genes listed in it.
Mark as dissimilar any query metacells which have more than max_misfit_genes (default: {max_misfit_genes}) genes whose projection fold is above max_projection_fold_factor, or, for genes in projected_noisy_gene, above an additional max_projection_noisy_fold_factor.
Mark as dissimilar any query metacells which did not fit at least min_fitted_query_marker_genes of the query marker genes.
If essential_genes_property and min_essential_genes are specified, the former should be the name(s) of boolean per-gene property/ies, and we will mark as dissimilar any query metacells which have at least this number of essential genes with a low projection fold factor.

Given an assignment of observations (cells) to groups (metacells), compute for each outlier the “most similar” group.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

In addition, gdata is assumed to have one observation for each group, and use the same genes as adata. Note that there’s no requirement that the gdata will contain the groups defined in adata. That is, it is possible to give query cells data in adata and atlas metacells in gdata to find the most similar atlas metacell for each outlier query metacell.

Returns

Sets the following in adata:

Per-Observation (Cell) Annotations

most_similar (default: most_similar)
For each observation (cell), the index of the “most similar” group.

Computation Parameters

Compute the log2 of the fraction of each gene in each of the outlier cells and the group metacells using the value_regularization (default: 1e-05).
Cross-correlate each of the outlier cells with each of the group metacells, in a reproducible manner.

Given an assignment of observations (cells) to groups (metacells) or, if an outlier, to the most similar groups, compute for each observation and gene the fold factor relative to its group for the purpose of detecting deviant cells.

Ideally, all grouped cells would have no genes with high enough fold factors to be considered deviants, and all outlier cells would. In practice grouped cells might have a (few) such genes to the restriction on the fraction of deviants.

It is important not to read too much into the results for a single cell, but looking at which genes appear for cell populations (e.g., cells with specific metadata such as batch identification) might be instructive.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

In addition, gdata is assumed to have one observation for each group, and use the same genes as adata.

Returns

Sets the following in adata:

Per-Variable Per-Observation (Gene-Cell) Annotations

deviant_fold
The fold factor between the cell’s UMIs and the expected number of UMIs for the purpose of computing deviant cells.

Computation Parameters

For each cell, compute the expected UMIs for each gene given the fraction of the gene in the metacells associated with the cell (the one it is belongs to, or the most similar one for outliers).
If the number of UMIs in the metacell (for grouped cells), or sum of the UMIs of the gene in an outlier cell and the metacell, is less than min_gene_total (default: 40), set the fold factor to 0 as we do not have sufficient data to robustly estimate it.

metacells.tools.quality.compute_inner_folds(*, adata: AnnData, gdata: AnnData, group: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'metacell') → None[source]¶: Given adata with computed deviant_fold for each gene for each cell, set in inner_fold in gdata, for each gene for each metacell the deviant_fold with the maximal absolute value.

metacells.tools.quality.compute_type_genes_normalized_variances(what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, adata: AnnData, gdata: AnnData, group_property: str = 'metacell', type_property: str = 'type', type_gene_normalized_variance_quantile: float = 0.95) → None[source]¶

Given metacells annotated data with type annotations, compute for each gene for each type how variable it is in the cells of the metacells of that type.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

In addition, gdata is assumed to have one observation for each group, and use the same genes as adata. This should have a type annotation.

Returns

Sets the following in gdata:

Per-Variable (gene) Annotations:

normalized_variance_in_<type>
For each type, the normalized variance (variance over mean) of the gene in the cells of the metacells of this type.

Computation Parameters

For each type_property (default: type) of metacell in gdata, for each metacell of this type, consider all the cells in adata whose group_property (default: metacell) is that metacell, compute the normalized variance (variance over mean) of each gene’s expression level, when normalizing each cell’s total UMIs to the median in its metacell.
Take the type_gene_normalized_variance_quantile (default: 0.95) of the normalized variance of each gene across all metacells of each type.

Given annotated data which is a slice containing just the outliers, where each has a “most similar” group, compute for each observation and gene the fold factor relative to its group.

All outliers should have at least one (typically several) genes with high fold factors, which are the reason they couldn’t be merged into their most similar group.

Input

Annotated adata, where the observations are outlier cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

In addition, gdata is assumed to have one observation for each group, and use the same genes as adata. It should have a marker_gene mask.

Returns

Sets the following in adata:

Per-Variable Per-Observation (Gene-Cell) Annotations

<most_similar>_fold (default: most_similar_fold)
The fold factor between the outlier gene expression and their expression in the most similar group, (unless the value is too low to be of interest, in which case it will be zero).

Computation Parameters

For each outlier, compute the expected UMIs for each gene given the fraction of the gene in the metacell associated with the outlier by the most_similar (default: most_similar).
If the sum of the UMIs of the gene in cell and the metacell are less than min_gene_total (default: 40), set the fold factor to 0 as we do not have sufficient data to robustly estimate it.

metacells.tools.quality.count_significant_inner_folds(adata: AnnData, *, min_gene_fold_factor: float = 3.0) → None[source]¶

Given grouped (metacells) data, count for each gene in how many metacells there is at least one cell with a fold factor above some threshold.

Input

Annotated adata, where the observations are metacells and the variables are genes, with an inner_fold layer (as computed by compute_inner_folds).

Returns

Sets the significant_inner_folds_count annotation, counting for each gene the number of metacells where the inner_fold is at least min_gene_fold_factor (default: 3.0), that is, where at least one cell in the metacell has a high fold factor for the gene’s expression compared to the estimated overall gene expression in the metacell.

, Distincts ———

Compute for each observation (cell) and each variable (gene) how much is the what (default: __x__) value different from the overall population.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Per-Observation-Per-Variable (Cell-Gene) Annotations:

distinct_ratio: For each gene in each cell, the log based 2 of the ratio between the fraction of the gene in the cell and the fraction of the gene in the overall population (sum of cells).

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as a pandas frame (indexed by the observation and distinct gene rank).

Computation Parameters

Compute, for each gene, the fraction of the gene’s values out of the total sum of the values (that is, the mean fraction of the gene’s expression in the population).
Compute, for each cell, for each gene, the fraction of the gene’s value out of the sum of the values in the cell (that is, the fraction of the gene’s expression in the cell).
Divide the two to the distinct ratio (that is, how much the gene’s expression in the cell is different from the overall population), first adding the normalization (default: 0) to both.
Compute the log (base 2) of the result and use it as the fold factor.

metacells.tools.distinct.find_distinct_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = 'distinct_fold', *, distinct_genes_count: int = 20, inplace: bool = True) → Tuple[PandasFrame, PandasFrame] | None[source]¶

Find for each observation (cell) the genes in which its what (default: distinct_fold) value is most distinct from the general population. This is typically applied to the metacells data rather than to the cells data.

Input

Annotated adata, where the observations are (mata)cells and the variables are genes, including a per-observation-per-variable annotated folds data, distinct_fold), e.g. as computed by compute_distinct_folds().

Returns

Observation-Any (Cell) Annotations

cell_distinct_gene_indices: For each cell, the indices of its top distinct_genes_count genes.
cell_distinct_gene_folds: For each cell, the fold factor of its top distinct_genes_count.

If inplace (default: True), this is written to the data, and the function returns None. Otherwise this is returned as two pandas frames (indexed by the observation and distinct gene rank).

Computation Parameters

Fetch the previously computed per-observation-per-variable what data.
Keep the distinct_genes_count (default: 20) top absolute fold factors.

Given a subset of the observations (cells), compute for each gene how distinct its what (default: __x__) value is in the subset compared to the overall population.

This is the area-under-curve of the receiver operating characteristic (AUROC) for the gene, that is, the probability that a randomly selected observation (cell) in the subset will have a higher value than a randomly selected observation (cell) outside the subset.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Variable (Gene) Annotations

<prefix>_fold: Store the ratio of the expression of the gene in the subset as opposed to the rest of the population.
<prefix>_auroc: Store the distinctiveness of the gene in the subset as opposed to the rest of the population.

If prefix (default: None), is specified, this is written to the data. Otherwise this is returned as two pandas series (indexed by the gene names).

Computation Parameters

Use the subset to assign a boolean label to each observation (cell). The subset can be a vector of integer observation names, or a boolean mask, or the string name of a per-observation annotation containing the boolean mask.
If scale is False, use the data as-is. If it is True, divide the data by the sum of each observation (cell). If it is a string, it should be the name of a per-observation annotation to use. Otherwise, it should be a vector of the scale factor for each observation (cell).
Compute the fold ratios using the normalization (no default!) and the AUROC for each gene, for the scaled data based on this mask.

Visualizing the Metacells¶

Layout¶

metacells.tools.layout.umap_by_distances(adata: AnnData, distances: str | ndarray | CompressedMatrix = 'umap_distances', *, prefix: str = '', k: int = 15, dimensions: int = 2, min_dist: float = 0.5, spread: float = 1.0, random_seed: int) → None[source]¶

Compute layout for the observations using UMAP, based on a distances matrix.

Input

The input annotated adata is expected to contain a per-observation-per-observation property distances (default: umap_distances), which describes the distance between each two observations (cells). The distances must be non-negative, symmetrical, and zero for self-distances (on the diagonal).

Returns

Sets the following annotations in adata:

Observation (Cell) Annotations

<prefix>x, <prefix>y: Coordinates for UMAP 2D projection of the observations (if dimensions is 2).
<prefix>u, <prefix>v, <prefix>w: Coordinates for UMAP 3D projection of the observations (if dimensions is 3).

Computation Parameters

Invoke UMAP to compute a layout of some dimensions (default: 2D) using min_dist (default: 0.5), spread (default: 1.0) and k (default: 15). If the spread is lower than the minimal distance, it is raised. If random_seed is not zero, then it is passed to UMAP to force the computation to be reproducible. However, this means UMAP will use a single-threaded implementation that will be slower.

metacells.tools.layout.spread_coordinates(adata: AnnData, *, prefix: str = '', suffix: str = '_spread', cover_fraction: float = 0.3333333333333333, noise_fraction: float = 0.1, random_seed: int) → None[source]¶

Move UMAP points so they cover some fraction of the plot area without overlapping.

Input

The input annotated adata is expected to contain the per-observation properties <prefix>x and <prefix>y (default prefix: ) which contain the UMAP coordinates.

Returns

Sets the following annotations in adata:

Observation (Cell) Annotations

<prefix>x<suffix>, <prefix>y<suffix> (default suffix: _spread): The new coordinates which will be spread out so the points do not overlap and cover some fraction of the total plot area.

Computation Parameters

Move the points so they cover cover_fraction (default: 0.3333333333333333) of the total plot area. Also add a noise of the noise_fraction (default: 0.1) of the minimal distance between the points. A non-zero random_seed will make this reproducible.

Projecting onto Metacells¶

Project¶

metacells.tools.project.compute_projection_weights(*, adata: AnnData, qdata: AnnData, from_atlas_layer: str = 'corrected_fraction', from_query_layer: str = 'corrected_fraction', to_query_layer: str = 'projected_fraction', log_data: bool = True, fold_regularization: float = 1e-05, min_significant_gene_umis: float = 40, max_consistency_fold_factor: float = 2.0, candidates_count: int = 50, min_candidates_fraction: float = 0.3333333333333333, min_usage_weight: float = 1e-05, second_anchor_indices: List[int] | None = None, reproducible: bool) → CompressedMatrix[source]¶

Compute the weights and results of projecting a query onto an atlas.

Input

Annotated query qdata and atlas adata, where the observations are cells and the variables are genes. The atlas should contain from_atlas_layer (default: corrected_fraction) containing gene fractions, and the query should similarly contain from_query_layer (default: corrected_fraction) containing gene fractions.

Returns

A CSR matrix whose rows are query metacells and columns are atlas metacells, where each entry is the weight of the atlas metacell in the projection of the query metacells. The sum of weights in each row (that is, for a single query metacell) is 1. The weighted sum of the atlas metacells using these weights is the “projected” image of the query metacell onto the atlas.

In addition, sets the following annotations in qdata:

Observation (Cell) Annotations

similar: A boolean mask indicating whether the query metacell is similar to its projection onto the atlas. If False the metacells is said to be “dissimilar”, which may indicate the query contains cell states that do not appear in the atlas.

Observation-Variable (Cell-Gene) Annotations

to_query_layer (default: projected_fraction): A matrix of gene fractions describing the “projected” image of the query metacell onto the atlas. This projection is a weighted average of some atlas metacells (using the computed weights returned by this function).

Computation Parameters

All fold computations (log2 of the ratio between gene fractions) use the fold_regularization (default: 1e-05).

For each query metacell:

Correlate the metacell with all the atlas metacells, and pick the highest-correlated one as the “anchor”. If second_anchor_indices is not None, then the qdata must contain only a single query metacell, and is expected to contain a projected per-observation-per-variable matrix containing the projected image of this query metacell on the atlas using a single anchor. The code will compute the residual of the query and the atlas relative to this projection and pick a second atlas anchor whose residuals are the most correlated to the query metacell’s residuals. If reproducible, a slower (still parallel) but reproducible algorithm will be used.
Consider (for each anchor) the candidates_count (default: 50) candidate metacells with the highest correlation with the query metacell.
Keep as candidates only atlas metacells whose maximal gene fold factor compared to the anchor(s) is at most max_consistency_fold_factor (default: 2.0). Keep at least min_candidates_fraction (default: 0.3333333333333333) of the original candidates even if they are less consistent. For this computation, Ignore the fold factors of genes whose sum of UMIs in the anchor(s) and the candidate metacells is less than min_significant_gene_umis (default: 40).
Compute the non-negative weights (with a sum of 1) of the selected candidates that give the best projection of the query metacells onto the atlas. If log_data (default: True), try to fit the log (base 2) of the fractions, otherwise, try to fit the fractions themselves. Since the algorithm for computing these weights rarely produces an exact 0 weight, reduce all weights less than the min_usage_weight (default: 1e-05) to zero. If second_anchor_indices is not None, it is set to the list of indices of the used atlas metacells candidates correlated with the second anchor.

metacells.tools.project.compute_projected_fractions(*, adata: AnnData, qdata: AnnData, from_atlas_layer: str = 'corrected_fraction', to_query_layer: str = 'projected_fraction', log_data: bool = True, fold_regularization: float = 1e-05, weights: ndarray | CompressedMatrix) → None[source]¶

Compute the projected image of a query on an atlas.

Input

Annotated query qdata and atlas adata, where the observations are cells and the variables are genes. The atlas should contain from_atlas_layer (default: corrected_fraction) containing gene fractions.

Returns

Sets to_query_layer (default: projected_fraction) in the query containing the gene fractions of the projection of the atlas fractions using the weights matrix.

Note

It is important to use the same log_data value as that given to compute_projection_weights to compute the weights (default: True).

metacells.tools.project.convey_atlas_to_query(*, adata: ~anndata._core.anndata.AnnData, qdata: ~anndata._core.anndata.AnnData, weights: ~numpy.ndarray | ~metacells.utilities.typing.CompressedMatrix, property_name: str, formatter: ~typing.Callable[[~typing.Any], ~typing.Any] | None = None, to_property_name: str | None = None, method: ~typing.Callable[[~numpy.ndarray | ~typing.Collection[int] | ~typing.Collection[float] | ~metacells.utilities.typing.PandasSeries, ~numpy.ndarray | ~typing.Collection[int] | ~typing.Collection[float] | ~metacells.utilities.typing.PandasSeries], ~typing.Any] = <function highest_weight>) → None[source]¶

Convey the value of a property from per-observation atlas data to per-observation query data.

The input annotated adata is expected to contain a per-observation (cell) annotation named property_name. Given the weights matrix, where each row specifies the weights of the atlas metacells used to project a single query metacell, this will generate a new per-observation (group) annotation in qdata, named to_property_name (by default, the same as property_name), containing the aggregated value of the property of all the observations (cells) that belong to the group.

The aggregation method (by default, metacells.utilities.computation.highest_weight()) is any function taking two array, weights and values, and returning a single value.