Tools¶
Functions for analysis tools.
Tools take as input some annotated data and either return some computed results, or write the
results as new annotations within the same data, or return a new annotated data object containing
the results. Tools are meant to be composable into a complete processing
metacells.pipeline
, typically by having one tool create annotation with “agreen upon”
name(s) and another further processing them. While this is flexible and convenient, there is no
static typing or analysis to ensure that the input of the next tool was actually created by previous
tool(s).
All the functions included here are exported under metacells.tl
.
General Tools¶
Apply¶
- class metacells.tools.apply.DefaultValues(slice: Any, full: Any)[source]¶
Default values to use in
apply_obs_annotations()
andapply_var_annotations()
.- slice: Any¶
The default value to use for the slice data.
- full: Any¶
The default value to use for the full data.
- class metacells.tools.apply.Skip[source]¶
A special value indicating to skip the annotation if it does not exist.
- class metacells.tools.apply.Raise[source]¶
A special value indicating to raise a
KeyError
if an annotation does not exist.
- metacells.tools.apply.apply_obs_annotations(adata: AnnData, sdata: AnnData, annotations: Dict[str, DefaultValues], *, indices: str | ndarray | Collection[int] | Collection[float] | PandasSeries) None [source]¶
Apply per-observation (cell) annotations of a slice
sdata
to the fulladata
.Input
Annotated
adata
, and a slice of itsdata
, where theindices
is either the vector of full indices of the slice observations, or the name of a per-observation annotation ofsdata
that contains this vector.Computation Parameters
Loop on each of the named
annotations
, where the value associated with the name is used as the default value (see below).If the slice data does not contain a per-observation (cell) annotation of this name, consider the
DefaultValues.slice
:If it is
Raise
, raise aKeyError
.If it is
Skip
, do not apply the annotation to the full data.Otherwise, behave as if the annotation’s value was a vector containing the
DefaultValues.slice
value.
If the full data does not contain a per-observation (cell) annotation of this name, consider the
DefaultValues.full
:If it is
Raise
, raise aKeyError
.If it is
Skip
, do not apply the annotation to the full data.Otherwise, initialize the annotation to a vector containing the
DefaultValues.full
value.
Apply the slice data values to the entries of the full data identified by the
indices
.
- metacells.tools.apply.apply_var_annotations(adata: AnnData, sdata: AnnData, annotations: Dict[str, DefaultValues], *, indices: str | ndarray | Collection[int] | Collection[float] | PandasSeries) None [source]¶
Apply per-variable (gene) annotations of a slice
sdata
to the fulladata
.Input
Annotated
adata
, and a slice of itsdata
, where theindices
is either the vector of full indices of the slice variables, or the name of a per-variable annotation ofsdata
that contains this vector.Computation Parameters
Loop on each of the named
annotations
, where the value associated with the name is used as the default value (see below).If the slice data does not contain a per-variable (gene) annotation of this name, consider the
DefaultValues.slice
:If it is
Raise
, raise aKeyError
.If it is
Skip
, do not apply the annotation to the full data.Otherwise, behave as if the annotation’s value was a vector containing the
DefaultValues.slice
value.
If the full data does not contain a per-variable (gene) variable of this name, consider the
DefaultValues.full
:If it is
Raise
, raise aKeyError
.If it is
Skip
, do not apply the annotation to the full data.Otherwise, initialize the annotation to a vector containing the
DefaultValues.full
value.
Apply the slice data values to the entries of the full data identified by the
indices
.
Convey¶
- metacells.tools.convey.convey_group_to_obs(*, adata: AnnData, gdata: AnnData, group: str = 'metacell', property_name: str, formatter: Callable[[Any], Any] | None = None, to_property_name: str | None = None, default: Any = None) None [source]¶
Project the value of a property from per-group data to per-observation data.
The input annotated
gdata
is expected to contain a per-observation (group) annotation namedproperty_name
. The input annotatedadata
is expected to contain a per-observation annotation namedgroup
which identifies the group each observation (cell) belongs to.This will generate a new per-observation (cell) annotation in
adata
, namedto_property_name
(by default, the same asproperty_name
), containing the value of the property for the group it belongs to. If thegroup
annotation contains a negative number instead of a valid group index, thedefault
value is used.
- metacells.tools.convey.convey_obs_to_obs(*, adata: AnnData, bdata: AnnData, property_name: str, formatter: Callable[[Any], Any] | None = None, to_property_name: str | None = None, default: Any = None) None [source]¶
Project the value of a property from one annotated data to another.
The observation names are expected to be compatible between
adata
andbdata
. The annotatedadata
is expected to contain a per-observation (cell) annotation namedproperty_name
.This will generate a new per-observation (cell) annotation in
bdata
, namedto_property_name
(by default, the same asproperty_name
), containing the value of the observation with the same name inadata
. If no such observation exists, thedefault
value is used.
- metacells.tools.convey.convey_obs_to_group(*, adata: ~anndata._core.anndata.AnnData, gdata: ~anndata._core.anndata.AnnData, group: str = 'metacell', property_name: str, formatter: ~typing.Callable[[~typing.Any], ~typing.Any] | None = None, to_property_name: str | None = None, method: ~typing.Callable[[~numpy.ndarray | ~typing.Collection[int] | ~typing.Collection[float] | ~metacells.utilities.typing.PandasSeries], ~typing.Any] = <function most_frequent>) None [source]¶
Project the value of a property from per-observation data to per-group data.
The input annotated
adata
is expected to contain a per-observation (cell) annotation namedproperty_name
and also a per-observation annotation namedgroup
which identifies the group each observation (cell) belongs to, which must be an integer.This will generate a new per-observation (group) annotation in
gdata
, namedto_property_name
(by default, the same asproperty_name
), containing the aggregated value of the property of all the observations (cells) that belong to the group.The aggregation method (by default,
metacells.utilities.computation.most_frequent()
) is any function taking an array of values and returning a single value.
- metacells.tools.convey.convey_obs_fractions_to_group(*, adata: AnnData, gdata: AnnData, group: str = 'metacell', property_name: str, formatter: Callable[[Any], Any] | None = None, to_property_name: str | None = None) None [source]¶
Similar to
convey_obs_to_group
, but create a per-metacell property for each value of the per-cell property, storing the fraction of cells of the metacell that had that value.The input annotated
adata
is expected to contain a per-observation (cell) annotation namedproperty_name
and also a per-observation annotation namedgroup
which identifies the group each observation (cell) belongs to, which must be an integer.This will generate multiple new per-observation (group) annotation in
gdata
, named<to_property_name>_fraction_of_<value>
(by default, theto_property_name
is the same asproperty_name
), containing the fraction of the metacell cells containing the specific property value.
- metacells.tools.convey.convey_obs_obs_to_group_group(*, adata: ~anndata._core.anndata.AnnData, gdata: ~anndata._core.anndata.AnnData, group: str = 'metacell', property_name: str, formatter: ~typing.Callable[[~typing.Any], ~typing.Any] | None = None, to_property_name: str | None = None, method: ~typing.Callable[[~numpy.ndarray | ~metacells.utilities.typing.CompressedMatrix | ~metacells.utilities.typing.PandasFrame | ~metacells.utilities.typing.SparseMatrix], ~typing.Any] = <function nanmean_matrix>) None [source]¶
Project the value of a property from per-observation-per-observation data to per-group-per-group data.
The input annotated
adata
is expected to contain a per-observation-per-observation (cell) annotation namedproperty_name
and also a per-observation annotation namedgroup
which identifies the group each observation (cell) belongs to, which must be an integer.This will generate a new per-observation-per-observation (group) annotation in
gdata
, namedto_property_name
(by default, the same asproperty_name
), containing the aggregated value of the property of all the observations (cells) that belong to the group.The aggregation method (by default,
metacells.utilities.computation.nanmean_matrix()
) is any function taking a matrix of values and returning a single value.
Filtering the Data¶
Filter¶
- metacells.tools.filter.filter_data(adata: AnnData, obs_masks: List[str] = [], var_masks: List[str] = [], *, mask_obs: str | None = None, mask_var: str | None = None, invert_obs: bool = False, invert_var: bool = False, track_obs: str | None = None, track_var: str | None = None, name: str | None = None, top_level: bool = True) Tuple[AnnData, PandasSeries, PandasSeries] | None [source]¶
Filter (slice) the data based on previously-computed masks.
For example, it is useful to discard cell-cycle genes, cells which have too few UMIs for meaningful analysis, etc. In general, the “best” filter depends on the data set.
This function makes it easy to combine different pre-computed per-observation (cell) and per-variable (gene) boolean mask annotations into a final overall inclusion mask, and slice the data accordingly, while tracking the base index of the cells and genes in the filtered data.
Input
Annotated
adata
, where the observations are cells and the variables are genes.Returns
An annotated data containing a subset of the observations (cells) and variables (genes).
If no observations and/or no variables were selected by the filter, returns
None
.If
name
is not specified, the returned data will be unnamed. Otherwise, if the name starts with a.
, it will be appended to the current name (if any). Otherwise,name
is the new name.If
mask_obs
and/ormask_var
are specified, store the mask of the selected data as a per-observation and/or per-variable annotation of the fulladata
.If
track_obs
and/ortrack_var
are specified, store the original indices of the selected data as a per-observation and/or per-variable annotation of the result data.Computation Parameters
Combine the masks in
obs_masks
and/orvar_masks
usingmetacells.tools.mask.combine_masks()
passing itinvert_obs
andinvert_var
, andmask_obs
andmask_var
as theto
parameter. If either list of masks is empty, use the full mask.If the obtained masks for either the observations or variables is empty, return
None
. Otherwise, return a slice of the full data containing just the observations and variables specified by the final masks.
Mask¶
- metacells.tools.mask.combine_masks(adata: AnnData, masks: Collection[str], *, invert: bool = False, to: str | None = None) PandasSeries | None [source]¶
Combine different pre-computed masks into a final overall mask.
Input
Annotated
adata
, where the observations are cells and the variables are genes.Returns
If
to
(default: None) isNone
, returns the computed mask. Otherwise, sets the mask as an annotation (per-variable or per-observation depending on the type of the combined masks).Computation Parameters
For each of the mask in
masks
, in order (left to right), fetch it. Silently ignore missing masks if the name has a?
suffix. If the first character of the mask name is&
, restrict the current mask, otherwise the first character must be|
and we’ll expand the mask (for the 1st mask, the mask becomes the current mask regardless of the 1st character). If the following character is~
, first invert the mask before applying it.
If
invert
(default: False), invert the final result mask.
Properly Sampled¶
- metacells.tools.properly_sampled.compute_excluded_gene_umis(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__') None [source]¶
Given an
excluded_gene
mask, compute the totalexcluded_umis
of each cell.
- metacells.tools.properly_sampled.find_properly_sampled_cells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, min_cell_total: int | None, max_cell_total: int | None, max_excluded_genes_fraction: float | None, inplace: bool = True) PandasSeries | None [source]¶
Detect cells with a “proper” amount of
what
(default: __x__) data.Due to both technical effects and natural variance between cells, the total number of UMIs varies from cell to cell. We often would like to work on cells that contain a sufficient number of UMIs for meaningful analysis; we sometimes also wish to exclude cells which have “too many” UMIs.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Observation (Cell) Annotations
properly_sampled_cell
A boolean mask indicating whether each cell has a “proper” amount of UMIs.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas series (indexed by the observation names).Computation Parameters
Exclude all cells whose total data is less than the
min_cell_total
(no default), unless it isNone
.Exclude all cells whose total data is more than the
max_cell_total
(no default), unless it isNone
.If
max_excluded_genes_fraction
(no default) is notNone
, then exclude all cells whose sum of the excluded data (as defined by theexcluded_gene
mask) divided by the total data is more than the specified threshold.
- metacells.tools.properly_sampled.find_properly_sampled_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, min_gene_total: int = 1, inplace: bool = True) PandasSeries | None [source]¶
Detect genes with a “proper” amount of
what
(default: __x__) data.Due to both technical effects and natural variance between genes, the expression of genes varies greatly between cells. This is exactly the information we are trying to analyze. We often would like to work on genes that have a sufficient level of expression for meaningful analysis. Specifically, it doesn’t make sense to analyze genes that have zero expression in all the cells.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Variable (Gene) Annotations
properly_sampled_gene
A boolean mask indicating whether each gene has a “proper” number of UMIs.
If
inplace
(default: True), this is written to the data and the function returnsNone
. Otherwise this is returned as a pandas series (indexed by the variable names).Computation Parameters
Exclude all genes whose total data is less than the
min_gene_total
(default: 1).
Named¶
- metacells.tools.named.find_named_genes(adata: AnnData, *, name_property: str | None = None, names: Collection[str] | None = None, patterns: Collection[str | Pattern] | None = None, to: str | None = None, invert: bool = False, op: str = 'set') PandasSeries | None [source]¶
Find genes by their (case-insensitive) name.
This computes a mask of all the genes whose name appears in
names
or matches any of thepatterns
. Ifinvert
(default: False), invert the resulting mask.Depending on
op
, this willset
a (compute a brand new) mask,add
the result to a mask (which must exist), orremove
genes from a mask (which must exist).If
name_property
is specified the mask will be based on this per-variable (gene) property.If
to
(default: None) is specified, this is stored as a per-variable (gene) annotation with that name, and returnsNone
. This is useful to fill gene masks such asexcluded_genes
(genes which should be excluded from the rest of the processing),lateral_genes
(genes which must not be selected for metacell computation) andnoisy_genes
(genes which are given more leeway when computing deviant cells).Otherwise, it returns it as a pandas series (indexed by the variable, that is gene, names).
High¶
- metacells.tools.high.find_high_total_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, min_gene_total: int, inplace: bool = True) PandasSeries | None [source]¶
Find genes which have high total number of
what
(default: __x__) data.This should typically only be applied to downsampled data to ensure that variance in sampling depth does not affect the result.
Genes with too-low expression are typically excluded from computations. In particular, genes may have all-zero expression, in which case including them just slows the computations (and triggers numeric edge cases).
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Variable (Gene) Annotations
high_total_gene
A boolean mask indicating whether each gene was found to have a high normalized variance.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas series (indexed by the variable names).Computation Parameters
Use
metacells.utilities.computation.sum_per()
to get the total UMIs of each gene.Select the genes whose fraction is at least
min_gene_total
.
- metacells.tools.high.find_high_topN_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, topN: int, min_gene_topN: int, inplace: bool = True) PandasSeries | None [source]¶
Find genes which have high total top-Nth value of
what
(default: __x__) data.This should typically only be applied to downsampled data to ensure that variance in sampling depth does not affect the result.
Genes with too-low expression are typically excluded from computations. In particular, genes may have all-zero expression, in which case including them just slows the computations (and triggers numeric edge cases).
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Variable (Gene) Annotations
high_top<topN>_gene
A boolean mask indicating whether each gene was found to have a high top-Nth value.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas series (indexed by the variable names).Computation Parameters
Use
metacells.utilities.computation.top_per()
to get the top-Nth UMIs of each gene.Select the genes whose fraction is at least
min_gene_topN
.
- metacells.tools.high.find_high_fraction_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, min_gene_fraction: float = 1e-05, inplace: bool = True) PandasSeries | None [source]¶
Find genes which have high fraction of the total
what
(default: __x__) data of the cells.Genes with too-low expression are typically excluded from computations. In particular, genes may have all-zero expression, in which case including them just slows the computations (and triggers numeric edge cases).
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Variable (Gene) Annotations
high_fraction_gene
A boolean mask indicating whether each gene was found to have a high normalized variance.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas series (indexed by the variable names).Computation Parameters
Use
metacells.utilities.computation.fraction_per()
to get the fraction of each gene.Select the genes whose fraction is at least
min_gene_fraction
(default: 1e-05).
- metacells.tools.high.find_high_normalized_variance_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, min_gene_normalized_variance: float = 5.656854249492381, inplace: bool = True) PandasSeries | None [source]¶
Find genes which have high normalized variance of
what
(default: __x__) data.The normalized variance measures the variance / mean of each gene. See
metacells.utilities.computation.normalized_variance_per()
for details.Genes with a high normalized variance are “bursty”, that is, have significantly different expression level in different cells.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Variable (Gene) Annotations
high_normalized_variance_gene
A boolean mask indicating whether each gene was found to have a high normalized variance.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas series (indexed by the variable names).Computation Parameters
Use
metacells.utilities.computation.normalized_variance_per()
to get the normalized variance of each gene.Select the genes whose normalized variance is at least
min_gene_normalized_variance
(default: 5.656854249492381).
- metacells.tools.high.find_high_relative_variance_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, min_gene_relative_variance: float = 0.1, window_size: int = 100, inplace: bool = True) PandasSeries | None [source]¶
Find genes which have high relative variance of
what
(default: __x__) data.The relative variance measures the variance / mean of each gene relative to the other genes with a similar level of expression. See
metacells.utilities.computation.relative_variance_per()
for details.Genes with a high relative variance are good candidates for being selected as “marker genes”, that is, be used to compute the similarity between cells. Using the relative variance compensates for the bias for selecting higher-expression genes, whose normalized variance can to be larger due to random noise alone.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Variable (Gene) Annotations
high_relative_variance_gene
A boolean mask indicating whether each gene was found to have a high relative variance.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas series (indexed by the variable names).Computation Parameters
Use
metacells.utilities.computation.relative_variance_per()
to get the relative variance of each gene.Select the genes whose relative variance is at least
min_gene_relative_variance
(default: 0.1).
- metacells.tools.high.find_metacells_marker_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, min_gene_range_fold: float = 2.0, regularization: float = 1e-05, min_max_gene_fraction: float = 0.0001, inplace: bool = True) PandasSeries | None [source]¶
Find “marker” genes which have a significant signal in metacells data. This computation is too unreliable to be used on cells.
Find genes which have a high maximal expression in at least one metacell, and a wide range of expression across the metacells. Such genes are good candidates for being used as marker genes and/or to compute distances between metacells.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Variable (Gene) Annotations
marker_gene
A boolean mask indicating whether each gene is a “marker”.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas series (indexed by the variable names).Computation Parameters
Compute the minimal and maximal expression level of each gene.
Select the genes whose fold factor (log2 of maximal over minimal value, using the
regularization
(default: 1e-05) is at leastmin_gene_range_fold
(default: 2.0).Select the genes whose maximal expression is at least
min_max_gene_fraction
(default: 0.0001).
Noisy Lonely¶
- metacells.tools.bursty_lonely.find_bursty_lonely_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, max_sampled_cells: int = 10000, downsample_min_samples: int = 750, downsample_min_cell_quantile: float = 0.5, downsample_max_cell_quantile: float = 0.05, min_gene_total: int = 100, min_gene_normalized_variance: float = 5.656854249492381, max_gene_similarity: float = 0.1, inplace: bool = True, random_seed: int) PandasSeries | None [source]¶
Detect “bursty lonely” genes based on
what
(default: __x__) data.Return the indices of genes which are “bursty” (have high variance compared to their mean) and also “lonely” (have low correlation with all other genes). Such genes should be excluded since they will never meaningfully help us compute groups, and will actively cause profiles to be considered “deviants”.
Noisy genes have high expression and variance. Lonely genes have no (or low) correlations with any other gene. Noisy lonely genes tend to throw off clustering algorithms. In general, such algorithms try to group together cells with the same overall biological state. Since the genes are lonely, they don’t contribute towards this goal. Since they are bursty, they actively hamper this, because they make cells which are otherwise similar appear different (just for this lonely gene).
It is therefore useful to explicitly identify, in a pre-processing step, the (few) such genes, and exclude them from the rest of the analysis.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Variable (Gene) Annotations
bursty_lonely_genes
A boolean mask indicating whether each gene was found to be a “bursty lonely” gene.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas series (indexed by the variable names).Computation Parameters
If we have more than
max_sampled_cells
(default: 10000), pick this number of random cells. Specify a non-zero random seed to make this reproducible.Invoke
metacells.tools.downsample.downsample_cells()
to downsample the cells to the same total number of UMIs, using thedownsample_min_samples
(default: 750),downsample_min_cell_quantile
(default: 0.5),downsample_max_cell_quantile
(default: 0.05) and therandom_seed
.Find “bursty” genes which have a total number of UMIs of at least
min_gene_total
(default: 100) and a normalized variance of at leastmin_gene_normalized_variance
(default:min_gene_normalized_variance
).Cross-correlate the bursty genes.
Find the bursty “lonely” genes whose maximal correlation is at most
max_gene_similarity
(default: 0.1) with all other genes.
Rare¶
- metacells.tools.rare.find_rare_gene_modules(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, max_genes: int = 500, max_gene_cell_fraction: float = 0.001, min_gene_maximum: int = 7, genes_similarity_method: str = 'repeated_pearson', genes_cluster_method: str = 'ward', min_genes_of_modules: int = 4, min_cells_of_modules: int = 12, target_metacell_size: float = 48, target_pile_size: int = 8000, max_cells_factor_of_random_pile: float = 0.5, min_module_correlation: float = 0.1, min_related_gene_fold_factor: float = 7, max_related_gene_increase_factor: float = 4.0, min_cell_module_total: int = 4, inplace: bool = True, reproducible: bool) Tuple[PandasFrame, PandasFrame] | None [source]¶
Detect rare genes modules based on
what
(default: __x__) data.Rare gene modules include genes which are weakly and rarely expressed, yet are highly correlated with each other, allowing for robust detection. Global analysis algorithms (such as metacells) tend to ignore or at least discount such genes.
It is therefore useful to explicitly identify, in a pre-processing step, the few cells which express such rare gene modules. Once identified, these cells can be exempt from the global algorithm, or the global algorithm can be tweaked in some way to pay extra attention to them.
If
reproducible
isTrue
, a slower (still parallel) but reproducible algorithm will be used to compute pearson correlations.Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Obeys (ignores the genes of) the
lateral_gene
per-gene (variable) annotation, if any.Returns
- Observation (Cell) Annotations
cells_rare_gene_module
The index of the rare gene module each cell expresses the most, or
-1
in the common case it does not express any rare genes module.rare_cell
A boolean mask for the (few) cells that express a rare gene module.
- Variable (Gene) Annotations
rare_gene
A boolean mask for the genes in any of the rare gene modules.
rare_gene_module
The index of the rare gene module a gene belongs to (-1 for non-rare genes).
If
inplace
, these are written to to the data, and the function returnsNone
. Otherwise they are returned as tuple containing two data frames.Computation Parameters
Pick as candidates all genes that are expressed in at most
max_gene_cell_fraction
(default: 0.001) of the cells, and whose maximal value in a cell is at leastmin_gene_maximum
(default: 7). If alateral_gene
masks exist, exclude them from the candidates. Out of the candidates, pick at mostmax_genes
(default 500) which are expressed in the least cells.Compute the similarity between the genes using
metacells.tools.similarity.compute_var_var_similarity()
using thegenes_similarity_method
(default: repeated_pearson).Create a hierarchical clustering of the candidate genes using the
genes_cluster_method
(default: ward).Identify gene modules in the hierarchical clustering which contain at least
min_genes_of_modules
genes (default: 4), with an average gene-gene cross-correlation of at leastmin_module_correlation
(default: 0.1).Consider cells expressing of any of the genes in the gene module. If the expected number of such cells in each random pile of size
target_pile_size
(default: 8000), whose total number of UMIs of the rare gene module is at leastmin_cell_module_total
(default: 4), is more than themax_cells_factor_of_random_pile
(default: 0.5) as a fraction of the target metacells size, then discard the rare gene module as not that rare after all.Add to the gene module all genes whose fraction in cells expressing any of the genes in the rare gene module is at least 2^``min_related_gene_fold_factor`` (default: 7) times their fraction in the rest of the population, as long as their maximal value in one of the expressing cells is at least
min_gene_maximum
, as long as this doesn’t add more thanmax_related_gene_increase_factor
times the original number of cells to the rare gene module, and as long as they are not listed in thelateral_gene
masks. If a gene is above the threshold for multiple gene modules, associate it with the gene module for which its fold factor is higher.Associate cells with the rare gene module if they contain at least
min_cell_module_total
(default: 4) UMIs of the expanded rare gene module. If a cell meets the above threshold for several rare gene modules, it is associated with the one for which it contains more UMIs.
Building the Graph¶
Downsample¶
- metacells.tools.downsample.downsample_cells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, downsample_min_cell_quantile: float = 0.05, downsample_min_samples: float = 750, downsample_max_cell_quantile: float = 0.5, inplace: bool = True, random_seed: int) Tuple[int, PandasFrame] | None [source]¶
Downsample the values of
what
(default: __x__) data.Downsampling is an effective way to get the same number of samples in multiple cells (that is, the same number of total UMIs in multiple cells), and serves as an alternative to normalization (e.g., working with UMI fractions instead of raw UMI counts).
Downsampling is especially important when computing correlations between cells. When there is high variance between the total UMI count in different cells, then normalization will return higher correlation values between cells with a higher total UMI count, which will result in an inflated estimation of their similarity to other cells. Downsampling avoids this effect.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Unstructured Annotations
downsample_samples
The target total number of samples in each downsampled cell.
- Variable-Observation (Gene-Cell) Annotations
downsampled
The downsampled data where the total number of samples in each cell is at most
downsample_samples
.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a tuple with the samples and a pandas data frame (indexed by the cell and gene names).Computation Parameters
Compute the total samples in each cell.
Decide on the value to downsample to. We would like all cells to end up with at least some reasonable number of samples (total UMIs)
downsample_min_samples
(default: 750). We’d also like all (most) cells to end up with the highest reasonable downsampled total number of samples, so if possible we increase the number of samples, as long as at mostdownsample_min_cell_quantile
(default: 0.05) cells will have lower number of samples. We’d also like all (most) cells to end up with the same downsampled total number of samples, so if we have to we decrease the number of samples to ensure at mostdownsample_max_cell_quantile
(default: 0.5) cells will have a lower number of samples.Downsample each cell so that it has at most the selected number of samples. Specify a non-zero
random_seed
to make this reproducible.
Cross-Similarity¶
- metacells.tools.similarity.compute_obs_obs_similarity(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, method: str = 'abs_pearson', logistics_location: float = 0.8, logistics_slope: float = 0.5, top: int | None = None, bottom: int | None = None, inplace: bool = True, reproducible: bool) PandasFrame | None [source]¶
Compute a measure of the similarity between the observations (cells) of
what
(default: __x__).If
reproducible
isTrue
, a slower (still parallel) but reproducible algorithm will be used to compute Pearson correlations.The
method
(default: abs_pearson) can be one of: *pearson
for computing Pearson correlation. *abs_pearson
for computing the absolute Pearson correlation. *repeated_pearson
for computing correlations-of-correlations. *repeated_abs_pearson
for computing absolute correlations-of-correlations. *logistics
for computing the logistics function. *logistics_pearson
for computing correlations-of-logistics. *logistics_abs_pearson
for computing absolute correlations-of-logistics.If using the logistics function, use the
logistics_slope
(default: 0.5) andlogistics_location
(default: 0.8).Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Observations-Pair (cells) Annotations
obs_similarity
A square matrix where each entry is the similarity between a pair of cells.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas data frame (indexed by the observation names).Computation Parameters
If
method
(default: abs_pearson) islogistics
orlogistics_pearson
, compute the mean value of the logistics function between the variables of each pair of observations (cells). Otherwise, it should bepearson
orrepeated_pearson
, so compute the cross-correlation between all the observations.If the
method
islogistics_pearson
orrepeated_pearson
, then compute the cross-correlation of the results of the previous step. That is, two observations (cells) will be similar if they are similar to the rest of the observations (cells) in the same way. This compensates for the extreme sparsity of the data.If
top
and/orbottom
are specified, keep just these number of most-similar and/or least-similar values in each row (turning the result into a compressed matrix format).
- metacells.tools.similarity.compute_var_var_similarity(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, method: str = 'abs_pearson', logistics_location: float = 0.8, logistics_slope: float = 0.5, top: int | None = None, bottom: int | None = None, inplace: bool = True, reproducible: bool) PandasFrame | None [source]¶
Compute a measure of the similarity between the variables (genes) of
what
(default: __x__).If
reproducible
isTrue
, a slower (still parallel) but reproducible algorithm will be used to compute Pearson correlations.Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.The
method
(default: abs_pearson) can be one of: *pearson
for computing Pearson correlation. *abs_pearson
for computing the absolute Pearson correlation. *repeated_pearson
for computing correlations-of-correlations. *repeated_abs_pearson
for computing absolute correlations-of-correlations. *logistics
for computing the logistics function. *logistics_pearson
for computing correlations-of-logistics. *logistics_abs_pearson
for computing absolute correlations-of-logistics.If using the logistics function, use the
logistics_slope
(default: 0.5) andlogistics_location
(default: 0.8).Returns
- Variable-Pair (genes) Annotations
var_similarity
A square matrix where each entry is the similarity between a pair of genes.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas data frame (indexed by the variable names).Computation Parameters
If
method
(default: abs_pearson) islogistics
orlogistics_pearson
, compute the mean value of the logistics function between the variables of each pair of variables (genes). Otherwise, it should bepearson
orrepeated_pearson
, so compute the cross-correlation between all the variables.If the
method
islogistics_pearson
orrepeated_pearson
, then compute the cross-correlation of the results of the previous step. That is, two variables (genes) will be similar if they are similar to the rest of the variables (genes) in the same way. This compensates for the extreme sparsity of the data.If
top
and/orbottom
are specified, keep just these number of most-similar and/or least-similar values in each row (turning the result into a compressed matrix format).
K-Nearest-Neighbors Graph¶
- metacells.tools.knn_graph.compute_obs_obs_knn_graph(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = 'obs_similarity', *, k: int, balanced_ranks_factor: float = 3.1622776601683795, incoming_degree_factor: float = 3.0, outgoing_degree_factor: float = 1.0, min_outgoing_degree: int = 2, inplace: bool = True) PandasFrame | None [source]¶
Compute a directed K-Nearest-Neighbors graph based on
what
(default: what) similarity data for each pair of observations (cells).Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-observation-per-observation matrix or the name of a per-observation-per-observation annotation containing such a matrix.Returns
- Observations-Pair Annotations
obs_outgoing_weights
A sparse square matrix where each non-zero entry is the weight of an edge between a pair of cells or genes, where the sum of the weights of the outgoing edges for each element is 1 (there is always at least one such edge).
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas data frame (indexed by the observation names).Computation Parameters
Use the
obs_similarity
and convert it to ranks (in descending order). This gives us a dense asymmetric<elements>_outgoing_ranks
matrix.Convert the asymmetric outgoing ranks matrix into a symmetric
obs_balanced_ranks
matrix by element-wise multiplying it with its transpose and taking the square root. That is, for each edge to be high-balanced-rank, the geomean of its outgoing rank has to be high in both nodes it connects.Note
This can drastically reduce the degree of the nodes, since to survive an edge needs to have been in the top ranks for both its nodes (as multiplying with zero drops the edge). This is why the
balanced_ranks_factor
needs to be large-ish.Keeping only balanced ranks of geomean of up to
k * balanced_ranks_factor
(default: 3.1622776601683795). This does a preliminary pruning of low-quality edges.Prune the edges, keeping only the
k * incoming_degree_factor
(default: k * 3.0) highest-ranked incoming edges for each node, and then only thek * outgoing_degree_factor
(default: 1.0) highest-ranked outgoing edges for each node, while ensuring that the highest-balanced-ranked outgoing edge of each node is preserved. This gives us an asymmetricobs_pruned_ranks
matrix, which has the structure we want, but not the correct edge weights yet.Note
Balancing the ranks, and then pruning the incoming edges, ensures that “hub” nodes, that is nodes that many other nodes prefer to connect with, end up connected to a limited number of such “spoke” nodes.
If there is any node which is left with an out degree of less than
min_outgoing_degree
(default: 2), increase K by 10% and repeat steps 2-4.Normalize the outgoing edge weights by dividing them with the sum of their balanced ranks, such that the sum of the outgoing edge weights for each node is 1. Note that there is always at least one outgoing edge for each node. This gives us the
obs_outgoing_weights
for our directed K-Nearest-Neighbors graph.Note
Ensuring each node has at least one outgoing edge allows us to always have at least one candidate grouping to add it to. This of course doesn’t protect the node from being rejected by its group as deviant.
- metacells.tools.knn_graph.compute_var_var_knn_graph(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = 'var_similarity', *, k: int, balanced_ranks_factor: float = 3.1622776601683795, incoming_degree_factor: float = 3.0, outgoing_degree_factor: float = 1.0, min_outgoing_degree: int = 2, inplace: bool = True) PandasFrame | None [source]¶
Compute a directed K-Nearest-Neighbors graph based on
what
(default: what) similarity data for each pair of variables (genes).Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-variable matrix or the name of a per-variable-per-variable annotation containing such a matrix.Returns
- Variables-Pair Annotations
var_outgoing_weights
A sparse square matrix where each non-zero entry is the weight of an edge between a pair of cells or genes, where the sum of the weights of the outgoing edges for each element is 1 (there is always at least one such edge).
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas data frame (indexed by the variable names).Computation Parameters
Use the
var_similarity
and convert it to ranks (in descending order). This gives us a dense asymmetric<elements>_outgoing_ranks
matrix.Convert the asymmetric outgoing ranks matrix into a symmetric
var_balanced_ranks
matrix by element-wise multiplying it with its transpose and taking the square root. That is, for each edge to be high-balanced-rank, the geomean of its outgoing rank has to be high in both nodes it connects.Keeping only balanced ranks of up to
k * k * balanced_ranks_factor
(default: 3.1622776601683795). This does a preliminary pruning of low-quality edges.Prune the edges, keeping only the
k * incoming_degree_factor
(default: k * 3.0) highest-ranked incoming edges for each node, and then only thek * outgoing_degree_factor
(default: 1.0) highest-ranked outgoing edges for each node, while ensuring that the highest-balanced-ranked outgoing edge of each node is preserved. This gives us an asymmetricvar_pruned_ranks
matrix, which has the structure we want, but not the correct edge weights yet.Note
Balancing the ranks, and then pruning the incoming edges, ensures that “hub” nodes, that is nodes that many other nodes prefer to connect with, end up connected to a limited number of such “spoke” nodes.
If there is any node which is left with an out degree of less than
min_outgoing_degree
(default: 2), increase K by 10% and repeat steps 2-4.Normalize the outgoing edge weights by dividing them with the sum of their balanced ranks, such that the sum of the outgoing edge weights for each node is 1. Note that there is always at least one outgoing edge for each node. This gives us the
var_outgoing_weights
for our directed K-Nearest-Neighbors graph.Note
Ensuring each node has at least one outgoing edge allows us to always have at least one candidate grouping to add it to. This of course doesn’t protect the node from being rejected by its group as deviant.
Computing the Metacells¶
Candidates¶
- metacells.tools.candidates.compute_candidate_metacells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = 'obs_outgoing_weights', *, target_metacell_size: int = 48, min_metacell_size: int = 12, target_metacell_umis: int = 160000, cell_umis: ndarray | None = None, min_seed_size_quantile: float = 0.85, max_seed_size_quantile: float = 0.95, cooldown_pass: float = 0.02, cooldown_node: float = 0.25, cooldown_phase: float = 0.75, increase_phase: float = 1.01, min_split_size_factor: float = 2.0, max_merge_size_factor: float = 0.5, max_split_min_cut_strength: float = 0.1, min_cut_seed_cells: int = 7, must_complete_cover: bool = False, random_seed: int, inplace: bool = True) PandasSeries | None [source]¶
Assign observations (cells) to (raw, candidate) metacells based on
what
data. (a weighted directed graph).These candidate metacells typically go through additional vetting (e.g. deviant detection and dissolving too-small metacells) to obtain the final metacells.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-observation-per-observation matrix where each row is the outgoing weights from each observation to the rest, or just the name of a per-observation-per-observation annotation containing such a matrix. Typically this matrix will be sparse for efficient processing.Returns
- Observation (Cell) Annotations
candidate
The integer index of the (raw, candidate) metacell each cell belongs to. The metacells are in no particular order.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas series (indexed by the variable names).Computation Parameters
If
cell_umis
is not specified, use the sum of thewhat
data for each cell.We are trying to create metacells of size
target_metacell_size
cells andtarget_metacell_umis
UMIs each. Compute the UMIs of the metacells by summing thecell_umis
.We start with some an assignment of cells to seeds using
choose_seeds()
usingmin_seed_size_quantile
(default: 0.85) andmax_seed_size_quantile
(default: 0.95) to compute them, picking a number of seeds such that the average metacell size would match the target.We optimize the seeds using
optimize_partitions()
to obtain initial communities by maximizing the “stability” of the solution (probability of starting at a random node and moving either forward or backward in the graph and staying within the same metacell, divided by the probability of staying in the metacell if the edges connected random nodes). We pass it thecooldown_pass
0.02) andcooldown_node
(default: 0.25).If
min_split_size_factor
(default: 2.0) is specified, split to two each community whose size is partition method on each community whose size is at leasttarget_metacell_size * min_split_size_factor
or whose UMIs are at leasttarget_metacell_umis * min_split_size_factor
, as long as half of the community is at least themin_metacell_size
(default: 12). Then, re-optimize the solution (resulting in an additional metacells). Every time we re-optimize, we multiply 1 -cooldown_pass
by 1 -cooldown_phase
(default: 0.75).Using
max_split_min_cut_strength
(default: 0.1), if the minimal cut of a candidate is lower, split it into two. If one of the partitions is smaller thanmin_cut_seed_cells
, then mark the cells in it as outliers, or ifmust_complete_cover
isTrue
, skip the cut altogether.Using
max_merge_size_factor
(default: 0.5) andmin_metacell_size
(default: 12), make outliers of cells of a community whose size is at mosttarget_metacell_size * max_merge_size_factor
and whose UMIs are at mosttarget_metacell_umis * max_merge_size_factor
, or that contain less cells thanmin_metacell_size
. Again, re-optimize, which will assign these cells to other metacells (resulting on one less metacell). We again apply thecooldown_phase
every time we re-optimize.Repeat the above steps until all metacells candidates are in the acceptable size range.
- metacells.tools.candidates.choose_seeds(*, edge_weights: CompressedMatrix, seed_of_cells: ndarray | None = None, max_seeds_count: int, min_seed_size_quantile: float = 0.85, max_seed_size_quantile: float = 0.95, random_seed: int) ndarray [source]¶
Choose initial assignment of cells to seeds based on the
edge_weights
.Returns a vector assigning each node (cell) to a seed (initial community).
If
seed_of_cells
is specified, it is expected to contain a vector of partial seeds. Only cells which have a negative seed will be assigned a new seed. New seeds will be created so that the total number of seeds will not exceedmax_seeds_count
. Theseed_of_cells
will be modified in-place and returned.Otherwise, a new vector is created, initialized with
-1
(that is, no seed) for all nodes, filled as above, and returned.Computation Parameters
We compute for each candidate node the number of nodes it is connected to (by an outgoing edge).
We pick as a seed a random node whose number of connected nodes (“seed size”) quantile is at least
min_seed_size_quantile
and at mostmax_seed_size_quantile
. This ensures we pick seeds that aren’t too small or too large to get a good coverage of the population with a low number of seeds.We assign each of the connected nodes to their seed, and discount them from the number of connected nodes of the remaining unassigned nodes.
We repeat this until we reach the target number of seeds.
- metacells.tools.candidates.optimize_partitions(*, edge_weights: CompressedMatrix, community_of_nodes: ndarray, node_umis: ndarray, low_partition_umis: int, target_partition_umis: int, high_partition_umis: int, low_partition_size: int, target_partition_size: int, high_partition_size: int, cooldown_pass: float = 0.02, cooldown_node: float = 0.25, random_seed: int) float [source]¶
Optimize partition to candidate metacells (communities) using the
edge_weights
.Returns the score of the optimized partition.
This modifies the
community_of_nodes
in-place.The goal is to minimize the “stability” goal function which is defined to be the ratio between (1) the probability that, selecting a random node and either a random outgoing edge or a random incoming edge (biased by their weights), that the node connected to by that edge is in the same community (metacell) and (2) the probability that a random edge would lead to this same community (the fraction of its number of nodes out of the total).
To maximize this, we repeatedly pass on a randomized permutation of the nodes, and for each node, move it to a random “better” community. When deciding if a community is better, we consider both (1) just the “local” product of the sum of the weights of incoming and outgoing edges between the node and the current and candidate communities and (2) the effect on the “global” goal function (considering the impact on this product for all other nodes connected to the current node).
We define a notion of
temperature
(initially, 1 -cooldown_pass
, default: {cooldown_pass}) and we give a weight oftemperature
to the local score and (1 -temperature
) to the global score. When we move to the next node, we multiply the temperature by 1 -cooldown_pass
. If we did not move the node, we multiply its temperature bycooldown_node
(default: {cooldown_node}). We skip looking at nodes which are colder from the global temperature to accelerate the algorithm. If we don’t move any node, we reduce the global temperature below that of any cold node; if there are no such nodes, we reduce it to zero to perform a final hill-climbing phase.This simulated-annealing-like behavior helps the algorithm to escape local maximums, although of course no claim is made of achieving the global maximum of the goal function.
- metacells.tools.candidates.score_partitions(*, node_umis: ndarray, low_partition_umis: float, target_partition_umis: float, high_partition_umis: float, low_partition_size: int, target_partition_size: int, high_partition_size: int, edge_weights: CompressedMatrix, partition_of_nodes: ndarray, temperature: float, with_orphans: bool = True) None [source]¶
Compute the “stability” the “stability” goal function which is defined to be the ratio between (1) the probability that, selecting a random node and either a random outgoing edge or a random incoming edge (biased by their weights), that the node connected to by that edge is in the same community (metacell) and (2) the probability that a random edge would lead to this same community (the fraction of its number of nodes out of the total).
If
with_orphans
is True (the default), outlier nodes are included in the computation. In general we add 1e-6 to the product of the incoming and outgoing weights so we can safely log it for efficient computation; thus orphans are given a very small (non-zero) weight so the overall score is not zeroed even when including them.
Deviants¶
- metacells.tools.deviants.find_deviant_cells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, candidates: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'candidate', min_gene_fold_factor: float = 3.0, min_compare_umis: int = 8, gap_skip_cells: int = 1, min_noisy_gene_fold_factor: float = 2.0, max_gene_fraction: float = 0.03, max_cell_fraction: float | None = 0.25, max_gap_cells_count: int = 3, max_gap_cells_fraction: float = 0.1, cells_regularization_quantile: float = 0.25, policy: str = 'gaps') ndarray | Collection[int] | Collection[float] | PandasSeries [source]¶
Find cells which are have significantly different gene expression from the metacells they are belong to based on
what
(default: __x__) data.Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Obeys (ignores the genes of) the
noisy_gene
per-gene (variable) annotation, if any.The exact method depends on the
policy
(one ofgaps
ormax
). By default we use thegaps
policy as it gives a much lower fraction of deviants at a minor cost in the variance inside each metacell. Themax
policy provides the inverse trade-off, giving slightly more consistent metacells at the cost of a much higher fraction of deviants.Returns
A boolean mask of all the cells which should be considered “deviant”.
Gaps Computation Parameters
Intuitively, for each gene for each metacell we can look at the sorted expression level of the gene in all the metacell’s cells. We look for a large gap between a few low-expressing or high-expressing cells and the rest of the cells. If we find such a gap, the few cells below or above it are considered to be deviants.
For each gene in each cell of each metacell, compute the log (base 2) of the fraction of the gene’s UMIs out of the total UMIs of the metacell, with a 1-UMI regularization factor.
Sort the expression level of each gene in each metacell.
Look for a gap of at least
min_gene_fold_factor
(default: 3.0), or fornoisy_gene
, an additionalmin_noisy_gene_fold_factor
(default: 2.0) between the sorted gene expressions. Ifgap_skip_cells
(default: 1) is 0, look for a gap between consecutive sorted cell expression levels. If it is 1 or 2, skip this number of entries. Ignore gaps if the total number of UMIs of the gene in the two compared cells is less thanmin_compare_umis
(default: 8).Ignore gaps that cause more than
max_gap_cells_fraction
(default: 0.1) and also more thanmax_gap_cells_count
(default: 3) to be separated. That is, a single gene can only mark as deviants “a few” cells of the metacell.If any cells were marked as deviants, re-run the above, ignoring any cells previously marked as deviants.
If the total number of cells is more than
max_cell_fraction
(default: 0.25) of the cells, increasemin_gene_fold_factor
by 0.15 (~x1.1) and try again from the top.
Max Computation Parameters
Compute for each candidate metacell the median fraction of the UMIs expressed by each gene. Scale this by each cell’s total UMIs to compute the expected number of UMIs for each cell. Compute the fold factor log2((actual UMIs + 1) / (expected UMIs + 1)) for each gene for each cell.
Compute the excess fold factor for each gene in each cell by subtracting
min_gene_fold_factor
(default: 3.0) from the above. Fornoisy_gene
, also subtractmin_noisy_gene_fold_factor
to the threshold.
For each cell, consider the maximal gene excess fold factor. Consider all cells with a positive maximal threshold as deviants. If more than
max_cell_fraction
(default: 0.25) of the cells have a positive maximal excess fold factor, increase the threshold from 0 so that only this fraction are marked as deviants.
Dissolve¶
- metacells.tools.dissolve.dissolve_metacells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, candidates: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'candidate', deviants: ndarray | Collection[int] | Collection[float] | PandasSeries, target_metacell_size: int = 48, min_metacell_size: int = 12, target_metacell_umis: int = 160000, cell_umis: ndarray | None = None, min_robust_size_factor: float = 0.5, min_convincing_gene_fold_factor: float | None = 3.0) None [source]¶
Dissolve too-small metacells based on
what
(default: __x__) data.Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
Sets the following in
adata
:- Observation (Cell) Annotations
metacell
The integer index of the metacell each cell belongs to. The metacells are in no particular order. Cells with no metacell assignment are given a metacell index of
-1
.dissolved
A boolean mask of the cells which were in a dissolved metacell.
Computation Parameters
If
cell_umis
is not specified, use the sum of thewhat
data for each cell.Mark all
deviants
cells “outliers”. This can be the name of a per-observation (cell) annotation, or an explicit boolean mask of cells, or a orNone
if there are no deviant cells to mark.Any metacell which has less cells than the
min_metacell_size
is dissolved into outlier cells.If
min_convincing_gene_fold_factor
is notNone
, preserve everything else. Otherwise:We are trying to create metacells of size
target_metacell_size
cells andtarget_metacell_umis
UMIs each. Compute the UMIs of the metacells by summing thecell_umis
.Using
min_robust_size_factor
(default: 0.5), any metacell whose total size is at leasttarget_metacell_size * min_robust_size_factor
or whose total UMIs are at leasttarget_metacell_umis * min_robust_size_factor
is preserved.Using
min_convincing_gene_fold_factor
, preserve any remaining metacells which have at least one gene whose fold factor (log2((actual + 1) / (expected_by_overall_population + 1))) is at least this high.
Dissolve the remaining metacells into outlier cells.
Evaluating the Metacells¶
Group¶
- metacells.tools.group.group_obs_data(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, groups: str | ndarray | Collection[int] | Collection[float] | PandasSeries, name: str | None = None, prefix: str | None = None) AnnData | None [source]¶
Compute new data which has the
what
(default: {what}) sum of the observations (cells) for each group.For example, having computed a metacell index for each cell, compute the per-metacell data for further analysis.
If
groups
is a string, it is expected to be the name of a per-observation vector annotation. Otherwise it should be a vector. The group indices should be integers, where negative values indicate “no group” and non-negative values indicate the index of the group to which each observation (cell) belongs to.Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
An annotated data where each observation is the sum of the group of original observations (cells). Observations with a negative group index are discarded. If all observations are discarded, return
None
.The new data will contain only:
A single observation for each group. The name of each observation will be the optional
prefix
(default: {prefix}), followed by the group’s index, followed by.
and a 2-digit checksum of the grouped members.An
X
member holding the summed-per-group data.A new
grouped
per-observation data which counts, for each group, the number of grouped observations summed into it.
If
name
is not specified, the data will be unnamed. Otherwise, if it starts with a.
, it will be appended to the current name (if any). Otherwise,name
is the new name.
- metacells.tools.group.group_obs_annotation(adata: AnnData, gdata: AnnData, *, groups: str | ndarray | Collection[int] | Collection[float] | PandasSeries, name: str, formatter: Callable[[Any], Any] | None = None, method: str = 'majority', min_value_fraction: float = 0.5, conflict: Any | None = None, inplace: bool = True) PandasSeries | None [source]¶
Transfer per-observation data from the per-observation (cell)
adata
to the per-group-of-observations (metacells)gdata
.Input
Annotated
adata
, where the observations are cells and the variables are genes, and thegdata
containing the per-metacells summed data.Returns
- Observations (Cell) Annotations
<name>
The per-group-observation annotation computed based on the per-observation annotation.
If
inplace
(default: True), this is written to thegdata
, and the function returnsNone
. Otherwise this is returned as a pandas series (indexed by the group observation names).Computation Parameters
Iterate on all the observations (groups, metacells) in
gdata
.Consider all the cells whose
groups
annotation maps them into this group.Consider all the
name
annotation values of these cells.Compute an annotation value for the whole group of cells using the
method
. Supported methods are:unique
All the values of all the cells in the group are expected to be the same, use this unique value for the whole groups.
majority
Use the most common value across all cells in the group as the value for the whole group. If this value doesn’t have at least
min_value_fraction
(default: 0.5) of the cells, use theconflict
(default: None) value instead.
Quality¶
- metacells.tools.quality.compute_stdev_logs(what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, min_gene_total: int = 40, adata: AnnData, gdata: AnnData, group: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'metacell') None [source]¶
Compute the standard deviation of the log (base 2) of the fraction of each gene in the cells of the metacell.
Ideally, the standard deviation should be ~1/3rd of the
deviants_min_gene_fold_factor
(which is3
by default), indicating that (all)most cells are within that maximal fold factor. In practice we may see higher values.Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation (UMIs) matrix or the name of a per-variable-per-observation annotation containing such a matrix.In addition,
gdata
is assumed to have one (fraction) observation for each metacell, atotal_umis
per metacell, and use the same genes asadata
.Returns
Sets the following in
gdata
:Per-Variable Per-Observation (Gene-Cell) Annotations
inner_stdev_log
For each gene and metacell, the normalized variance (variance over mean) of the gene in the metacell, if it has a sufficient number of UMIs to make this meaningful (otherwise, is 0).
Computation Parameters
For each metacell:
Compute the log (base 2) of the fractions of the UMIs of each gene in each cell, regularized by 1 UMI.
Compute the standard deviation of these logs for each gene across all cells of each metacell.
- metacells.tools.quality.compute_projected_folds(qdata: AnnData, from_query_layer: str = 'corrected_fraction', to_query_layer: str = 'projected_fraction', fold_regularization: float = 1e-05, min_significant_gene_umis: float = 40) None [source]¶
Compute the projected fold factors of genes for each query metacell.
This computes, for each metacell of the query, the fold factors between the corrected and projected gene fractions projection of the metacell onto the atlas (see
metacells.tools.project.compute_projection_weights()
).Input
Annotated query
qdata
, where the observations are query metacells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.In addition, the
projected
UMIs of each query metacells onto the atlas.Returns
Sets the following in
qdata
:- Per-Variable Per-Observation (Gene-Cell) Annotations
projected_fold
For each gene and query metacell, the fold factor of this gene between the query and its projection.
Computation Parameters
For each group (metacell), for each gene, compute the gene’s fold factor log2((
from_query_layer
(default: corrected_fraction) +fold_regularization
) / (to_query_layer
(default: projected_fraction) fractions +fold_regularization
)), similarly tometacells.tools.project.compute_projection_weights()
(the defaultfold_regularization
is 1e-05).Set the fold factor to zero for every case where the total UMIs of the gene in the query metacell are not at least
min_significant_gene_umis
(default: 40).
- metacells.tools.quality.compute_similar_query_metacells(qdata: AnnData, max_projection_fold_factor: float = 3.0, max_projection_noisy_fold_factor: float = 2.0, min_fitted_query_marker_genes: float = 0, max_misfit_genes: int = 3, essential_genes_property: None | str | Collection[str] = None, min_essential_genes: int | None = None, fitted_genes_mask: ndarray | None = None) None [source]¶
Mark query metacells that are “similar” to their projection on the atlas.
This does not guarantee the query metacell is “the same as” its projection on the atlas; rather, it means the two are “sufficiently similar” that one can be reasonably confident in applying atlas metadata to the query metacell based on the projection.
Input
Annotated query
qdata
, where the observations are metacells and the variables are genes.The data should contain per-observation-per-variable annotations
projected_fold
with the significant projection folds factors, as computed bycompute_projected_folds()
. Ifmin_essential_significant_genes_fraction
, andessential_genes_property
are specified, then the data may contain additional per-observation (gene) mask(s) denoting the essential genes.If a
projected_noisy_gene
mask exists, then the genes in it allow for a higher fold factor than normal genes.Returns
Sets the following in
qdata
:Per-Observation (Cell) Annotations
similar
A boolean mask indicating the query metacell is similar to its projection in the atlas.
- Per-Variable Per-Observation (Gene-Cell) Annotations
misfit
Whether the gene has a too-high fold factor between the query and its projection in the atlas.
Computation Parameters
If
fitted_genes_mask
is notNone
, restrict the analysis to the genes listed in it.Mark as dissimilar any query metacells which have more than
max_misfit_genes
(default: {max_misfit_genes}) genes whose projection fold is abovemax_projection_fold_factor
, or, for genes inprojected_noisy_gene
, above an additionalmax_projection_noisy_fold_factor
.Mark as dissimilar any query metacells which did not fit at least
min_fitted_query_marker_genes
of the query marker genes.If
essential_genes_property
andmin_essential_genes
are specified, the former should be the name(s) of boolean per-gene property/ies, and we will mark as dissimilar any query metacells which have at least this number of essential genes with a low projection fold factor.
- metacells.tools.quality.compute_outliers_matches(what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, adata: AnnData, gdata: AnnData, group: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'metacell', most_similar: str = 'most_similar', value_regularization: float = 1e-05, reproducible: bool) None [source]¶
Given an assignment of observations (cells) to groups (metacells), compute for each outlier the “most similar” group.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.In addition,
gdata
is assumed to have one observation for each group, and use the same genes asadata
. Note that there’s no requirement that thegdata
will contain the groups defined inadata
. That is, it is possible to give query cells data inadata
and atlas metacells ingdata
to find the most similar atlas metacell for each outlier query metacell.Returns
Sets the following in
adata
:Per-Observation (Cell) Annotations
most_similar
(default: most_similar)For each observation (cell), the index of the “most similar” group.
Computation Parameters
Compute the log2 of the fraction of each gene in each of the outlier cells and the group metacells using the
value_regularization
(default: 1e-05).Cross-correlate each of the outlier cells with each of the group metacells, in a
reproducible
manner.
- metacells.tools.quality.compute_deviant_folds(what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, adata: AnnData, gdata: AnnData, group: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'metacell', most_similar: str | ndarray | Collection[int] | Collection[float] | PandasSeries | None = 'most_similar', min_gene_total: int = 40) None [source]¶
Given an assignment of observations (cells) to groups (metacells) or, if an outlier, to the most similar groups, compute for each observation and gene the fold factor relative to its group for the purpose of detecting deviant cells.
Ideally, all grouped cells would have no genes with high enough fold factors to be considered deviants, and all outlier cells would. In practice grouped cells might have a (few) such genes to the restriction on the fraction of deviants.
It is important not to read too much into the results for a single cell, but looking at which genes appear for cell populations (e.g., cells with specific metadata such as batch identification) might be instructive.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.In addition,
gdata
is assumed to have one observation for each group, and use the same genes asadata
.Returns
Sets the following in
adata
:Per-Variable Per-Observation (Gene-Cell) Annotations
deviant_fold
The fold factor between the cell’s UMIs and the expected number of UMIs for the purpose of computing deviant cells.
Computation Parameters
For each cell, compute the expected UMIs for each gene given the fraction of the gene in the metacells associated with the cell (the one it is belongs to, or the most similar one for outliers).
If the number of UMIs in the metacell (for grouped cells), or sum of the UMIs of the gene in an outlier cell and the metacell, is less than
min_gene_total
(default: 40), set the fold factor to 0 as we do not have sufficient data to robustly estimate it.
- metacells.tools.quality.compute_inner_folds(*, adata: AnnData, gdata: AnnData, group: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'metacell') None [source]¶
Given
adata
with computeddeviant_fold
for each gene for each cell, set ininner_fold
ingdata
, for each gene for each metacell thedeviant_fold
with the maximal absolute value.
- metacells.tools.quality.compute_type_genes_normalized_variances(what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, adata: AnnData, gdata: AnnData, group_property: str = 'metacell', type_property: str = 'type', type_gene_normalized_variance_quantile: float = 0.95) None [source]¶
Given metacells annotated data with type annotations, compute for each gene for each type how variable it is in the cells of the metacells of that type.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.In addition,
gdata
is assumed to have one observation for each group, and use the same genes asadata
. This should have a type annotation.Returns
Sets the following in
gdata
:Per-Variable (gene) Annotations:
normalized_variance_in_<type>
For each type, the normalized variance (variance over mean) of the gene in the cells of the metacells of this type.
Computation Parameters
For each
type_property
(default: type) of metacell ingdata
, for each metacell of this type, consider all the cells inadata
whosegroup_property
(default: metacell) is that metacell, compute the normalized variance (variance over mean) of each gene’s expression level, when normalizing each cell’s total UMIs to the median in its metacell.Take the
type_gene_normalized_variance_quantile
(default: 0.95) of the normalized variance of each gene across all metacells of each type.
- metacells.tools.quality.compute_outliers_fold_factors(what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, adata: AnnData, gdata: AnnData, most_similar: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'most_similar', min_gene_total: int = 40) None [source]¶
Given annotated data which is a slice containing just the outliers, where each has a “most similar” group, compute for each observation and gene the fold factor relative to its group.
All outliers should have at least one (typically several) genes with high fold factors, which are the reason they couldn’t be merged into their most similar group.
Input
Annotated
adata
, where the observations are outlier cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.In addition,
gdata
is assumed to have one observation for each group, and use the same genes asadata
. It should have amarker_gene
mask.Returns
Sets the following in
adata
:Per-Variable Per-Observation (Gene-Cell) Annotations
<most_similar>_fold
(default: most_similar_fold)The fold factor between the outlier gene expression and their expression in the most similar group, (unless the value is too low to be of interest, in which case it will be zero).
Computation Parameters
For each outlier, compute the expected UMIs for each gene given the fraction of the gene in the metacell associated with the outlier by the
most_similar
(default: most_similar).If the sum of the UMIs of the gene in cell and the metacell are less than
min_gene_total
(default: 40), set the fold factor to 0 as we do not have sufficient data to robustly estimate it.
- metacells.tools.quality.count_significant_inner_folds(adata: AnnData, *, min_gene_fold_factor: float = 3.0) None [source]¶
Given grouped (metacells) data, count for each gene in how many metacells there is at least one cell with a fold factor above some threshold.
Input
Annotated
adata
, where the observations are metacells and the variables are genes, with aninner_fold
layer (as computed bycompute_inner_folds
).Returns
Sets the
significant_inner_folds_count
annotation, counting for each gene the number of metacells where theinner_fold
is at leastmin_gene_fold_factor
(default: 3.0), that is, where at least one cell in the metacell has a high fold factor for the gene’s expression compared to the estimated overall gene expression in the metacell.
, Distincts ———
- metacells.tools.distinct.compute_distinct_folds(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, normalization: float = 0, inplace: bool = True) PandasFrame | None [source]¶
Compute for each observation (cell) and each variable (gene) how much is the
what
(default: __x__) value different from the overall population.Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Per-Observation-Per-Variable (Cell-Gene) Annotations:
distinct_ratio
For each gene in each cell, the log based 2 of the ratio between the fraction of the gene in the cell and the fraction of the gene in the overall population (sum of cells).
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as a pandas frame (indexed by the observation and distinct gene rank).Computation Parameters
Compute, for each gene, the fraction of the gene’s values out of the total sum of the values (that is, the mean fraction of the gene’s expression in the population).
Compute, for each cell, for each gene, the fraction of the gene’s value out of the sum of the values in the cell (that is, the fraction of the gene’s expression in the cell).
Divide the two to the distinct ratio (that is, how much the gene’s expression in the cell is different from the overall population), first adding the
normalization
(default: 0) to both.Compute the log (base 2) of the result and use it as the fold factor.
- metacells.tools.distinct.find_distinct_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = 'distinct_fold', *, distinct_genes_count: int = 20, inplace: bool = True) Tuple[PandasFrame, PandasFrame] | None [source]¶
Find for each observation (cell) the genes in which its
what
(default: distinct_fold) value is most distinct from the general population. This is typically applied to the metacells data rather than to the cells data.Input
Annotated
adata
, where the observations are (mata)cells and the variables are genes, including a per-observation-per-variable annotated folds data, distinct_fold), e.g. as computed bycompute_distinct_folds()
.Returns
- Observation-Any (Cell) Annotations
cell_distinct_gene_indices
For each cell, the indices of its top
distinct_genes_count
genes.cell_distinct_gene_folds
For each cell, the fold factor of its top
distinct_genes_count
.
If
inplace
(default: True), this is written to the data, and the function returnsNone
. Otherwise this is returned as two pandas frames (indexed by the observation and distinct gene rank).Computation Parameters
Fetch the previously computed per-observation-per-variable
what
data.Keep the
distinct_genes_count
(default: 20) top absolute fold factors.
- metacells.tools.distinct.compute_subset_distinct_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, prefix: str | None = None, scale: bool | str | ndarray | None, subset: str | ndarray, normalization: float) Tuple[PandasSeries, PandasSeries] | None [source]¶
Given a subset of the observations (cells), compute for each gene how distinct its
what
(default: __x__) value is in the subset compared to the overall population.This is the area-under-curve of the receiver operating characteristic (AUROC) for the gene, that is, the probability that a randomly selected observation (cell) in the subset will have a higher value than a randomly selected observation (cell) outside the subset.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
- Variable (Gene) Annotations
<prefix>_fold
Store the ratio of the expression of the gene in the subset as opposed to the rest of the population.
<prefix>_auroc
Store the distinctiveness of the gene in the subset as opposed to the rest of the population.
If
prefix
(default: None), is specified, this is written to the data. Otherwise this is returned as two pandas series (indexed by the gene names).Computation Parameters
Use the
subset
to assign a boolean label to each observation (cell). Thesubset
can be a vector of integer observation names, or a boolean mask, or the string name of a per-observation annotation containing the boolean mask.If
scale
isFalse
, use the data as-is. If it isTrue
, divide the data by the sum of each observation (cell). If it is a string, it should be the name of a per-observation annotation to use. Otherwise, it should be a vector of the scale factor for each observation (cell).Compute the fold ratios using the
normalization
(no default!) and the AUROC for each gene, for the scaled data based on this mask.
Visualizing the Metacells¶
Layout¶
- metacells.tools.layout.umap_by_distances(adata: AnnData, distances: str | ndarray | CompressedMatrix = 'umap_distances', *, prefix: str = '', k: int = 15, dimensions: int = 2, min_dist: float = 0.5, spread: float = 1.0, random_seed: int) None [source]¶
Compute layout for the observations using UMAP, based on a distances matrix.
Input
The input annotated
adata
is expected to contain a per-observation-per-observation propertydistances
(default: umap_distances), which describes the distance between each two observations (cells). The distances must be non-negative, symmetrical, and zero for self-distances (on the diagonal).Returns
Sets the following annotations in
adata
:- Observation (Cell) Annotations
<prefix>x
,<prefix>y
Coordinates for UMAP 2D projection of the observations (if
dimensions
is 2).<prefix>u
,<prefix>v
,<prefix>w
Coordinates for UMAP 3D projection of the observations (if
dimensions
is 3).
Computation Parameters
Invoke UMAP to compute a layout of some
dimensions
(default: 2D) usingmin_dist
(default: 0.5),spread
(default: 1.0) andk
(default: 15). If the spread is lower than the minimal distance, it is raised. Ifrandom_seed
is not zero, then it is passed to UMAP to force the computation to be reproducible. However, this means UMAP will use a single-threaded implementation that will be slower.
- metacells.tools.layout.spread_coordinates(adata: AnnData, *, prefix: str = '', suffix: str = '_spread', cover_fraction: float = 0.3333333333333333, noise_fraction: float = 0.1, random_seed: int) None [source]¶
Move UMAP points so they cover some fraction of the plot area without overlapping.
Input
The input annotated
adata
is expected to contain the per-observation properties<prefix>x
and<prefix>y
(default prefix: ) which contain the UMAP coordinates.Returns
Sets the following annotations in
adata
:- Observation (Cell) Annotations
<prefix>x<suffix>
,<prefix>y<suffix>
(default suffix: _spread)The new coordinates which will be spread out so the points do not overlap and cover some fraction of the total plot area.
Computation Parameters
Move the points so they cover
cover_fraction
(default: 0.3333333333333333) of the total plot area. Also add a noise of thenoise_fraction
(default: 0.1) of the minimal distance between the points. A non-zerorandom_seed
will make this reproducible.
Projecting onto Metacells¶
Project¶
- metacells.tools.project.compute_projection_weights(*, adata: AnnData, qdata: AnnData, from_atlas_layer: str = 'corrected_fraction', from_query_layer: str = 'corrected_fraction', to_query_layer: str = 'projected_fraction', log_data: bool = True, fold_regularization: float = 1e-05, min_significant_gene_umis: float = 40, max_consistency_fold_factor: float = 2.0, candidates_count: int = 50, min_candidates_fraction: float = 0.3333333333333333, min_usage_weight: float = 1e-05, second_anchor_indices: List[int] | None = None, reproducible: bool) CompressedMatrix [source]¶
Compute the weights and results of projecting a query onto an atlas.
Input
Annotated query
qdata
and atlasadata
, where the observations are cells and the variables are genes. The atlas should containfrom_atlas_layer
(default: corrected_fraction) containing gene fractions, and the query should similarly containfrom_query_layer
(default: corrected_fraction) containing gene fractions.Returns
A CSR matrix whose rows are query metacells and columns are atlas metacells, where each entry is the weight of the atlas metacell in the projection of the query metacells. The sum of weights in each row (that is, for a single query metacell) is 1. The weighted sum of the atlas metacells using these weights is the “projected” image of the query metacell onto the atlas.
In addition, sets the following annotations in
qdata
:- Observation (Cell) Annotations
similar
A boolean mask indicating whether the query metacell is similar to its projection onto the atlas. If
False
the metacells is said to be “dissimilar”, which may indicate the query contains cell states that do not appear in the atlas.
- Observation-Variable (Cell-Gene) Annotations
to_query_layer
(default: projected_fraction)A matrix of gene fractions describing the “projected” image of the query metacell onto the atlas. This projection is a weighted average of some atlas metacells (using the computed weights returned by this function).
Computation Parameters
All fold computations (log2 of the ratio between gene fractions) use the
fold_regularization
(default: 1e-05).
For each query metacell:
Correlate the metacell with all the atlas metacells, and pick the highest-correlated one as the “anchor”. If
second_anchor_indices
is notNone
, then theqdata
must contain only a single query metacell, and is expected to contain aprojected
per-observation-per-variable matrix containing the projected image of this query metacell on the atlas using a single anchor. The code will compute the residual of the query and the atlas relative to this projection and pick a second atlas anchor whose residuals are the most correlated to the query metacell’s residuals. Ifreproducible
, a slower (still parallel) but reproducible algorithm will be used.Consider (for each anchor) the
candidates_count
(default: 50) candidate metacells with the highest correlation with the query metacell.Keep as candidates only atlas metacells whose maximal gene fold factor compared to the anchor(s) is at most
max_consistency_fold_factor
(default: 2.0). Keep at leastmin_candidates_fraction
(default: 0.3333333333333333) of the original candidates even if they are less consistent. For this computation, Ignore the fold factors of genes whose sum of UMIs in the anchor(s) and the candidate metacells is less thanmin_significant_gene_umis
(default: 40).Compute the non-negative weights (with a sum of 1) of the selected candidates that give the best projection of the query metacells onto the atlas. If
log_data
(default: True), try to fit the log (base 2) of the fractions, otherwise, try to fit the fractions themselves. Since the algorithm for computing these weights rarely produces an exact 0 weight, reduce all weights less than themin_usage_weight
(default: 1e-05) to zero. Ifsecond_anchor_indices
is notNone
, it is set to the list of indices of the used atlas metacells candidates correlated with the second anchor.
- metacells.tools.project.compute_projected_fractions(*, adata: AnnData, qdata: AnnData, from_atlas_layer: str = 'corrected_fraction', to_query_layer: str = 'projected_fraction', log_data: bool = True, fold_regularization: float = 1e-05, weights: ndarray | CompressedMatrix) None [source]¶
Compute the projected image of a query on an atlas.
Input
Annotated query
qdata
and atlasadata
, where the observations are cells and the variables are genes. The atlas should containfrom_atlas_layer
(default: corrected_fraction) containing gene fractions.Returns
Sets
to_query_layer
(default: projected_fraction) in the query containing the gene fractions of the projection of the atlas fractions using theweights
matrix.Note
It is important to use the same
log_data
value as that given tocompute_projection_weights
to compute the weights (default: True).
- metacells.tools.project.convey_atlas_to_query(*, adata: ~anndata._core.anndata.AnnData, qdata: ~anndata._core.anndata.AnnData, weights: ~numpy.ndarray | ~metacells.utilities.typing.CompressedMatrix, property_name: str, formatter: ~typing.Callable[[~typing.Any], ~typing.Any] | None = None, to_property_name: str | None = None, method: ~typing.Callable[[~numpy.ndarray | ~typing.Collection[int] | ~typing.Collection[float] | ~metacells.utilities.typing.PandasSeries, ~numpy.ndarray | ~typing.Collection[int] | ~typing.Collection[float] | ~metacells.utilities.typing.PandasSeries], ~typing.Any] = <function highest_weight>) None [source]¶
Convey the value of a property from per-observation atlas data to per-observation query data.
The input annotated
adata
is expected to contain a per-observation (cell) annotation namedproperty_name
. Given theweights
matrix, where each row specifies the weights of the atlas metacells used to project a single query metacell, this will generate a new per-observation (group) annotation inqdata
, namedto_property_name
(by default, the same asproperty_name
), containing the aggregated value of the property of all the observations (cells) that belong to the group.The aggregation method (by default,
metacells.utilities.computation.highest_weight()
) is any function taking two array, weights and values, and returning a single value.