Pipeline¶
Default metacell processing pipeline.
The functions here are thin wrappers which invoke a series of steps (metacells.tools
) to
provide a complete pipeline for computing metacells for your data.
This pipeline can be configured by tweaking the (many) parameters, but for any deeper customization (adding and/or removing steps) just provide your own pipeline instead. You can use the implementation here as a starting point.
All the functions included here are exported under metacells.pl
.
Detect coarse relations between genes and lateral genes based on
what
(default: __x__) data.This is a quick-and-dirty way to look for genes highly correlated with lateral genes.
Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix. The data should contain alateral_gene
mask containing some known-to-be lateral genes.Returns
- Variable-pair (Gene) Annotations
lateral_genes_similarity
The similarity between each genes related to lateral genes.
- Variable (Gene) Annotations
lateral_genes_module
The index of the related to lateral gene module for each gene.
Computation Parameters
If we have more than
max_sampled_cells
(default: 10000), pick this number of random cells from the data using therandom_seed
.Start by marking as “related genes” all the genes listed in
lateral_gene
.Pick as candidates genes whose mean fraction in the population is at least the (very low)
min_mean_gene_fraction
(default: 1e-05) (and that are not lateral genes).Compute the correlation between the related genes and the candidate genes. Move from the candidate genes to the lateral genes any genes whose correlation is at least
min_gene_correlation
(default: 0.1).Repeat step 4 until no additional related genes are found.
Compute the similarity between the all related genes using
metacells.tools.similarity.compute_var_var_similarity()
using thegenes_similarity_method
(default: abs_pearson).Create a hierarchical clustering of the candidate genes using the
genes_cluster_method
(default: ward).Identify gene modules in the hierarchical clustering by bottom-up combining genes as long as we have at most
max_genes_of_modules
(default: 64). Note that this may leave modules with a smaller number of genes, and modules that do not contain any lateral genes.
Exclude¶
Raw single-cell RNA sequencing data is notoriously noisy and “dirty”. The pipeline steps here performs initial analysis of the data and exclude some of it, so it would not harm the metacells computation. The steps provided here are expected to be generically useful, but as always specific data sets may require custom cleaning steps on a case-by-case basis.
- metacells.pipeline.exclude.exclude_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, bursty_max_sampled_cells: int | None = 10000, bursty_downsample_min_samples: int = 750, bursty_downsample_min_cell_quantile: float = 0.5, bursty_downsample_max_cell_quantile: float = 0.05, bursty_min_gene_total: int = 100, bursty_min_gene_normalized_variance: float = 5.656854249492381, bursty_max_gene_similarity: float = 0.1, properly_sampled_min_gene_total: int | None = 1, excluded_gene_names: Collection[str] | None = None, excluded_gene_patterns: Collection[str | Pattern] | None = None, random_seed: int) None [source]¶
Exclude a subset of the genes from the metacells computation.
You can also just manually set the
excluded_gene
mask, or further manipulate it after calling this function.Input
Annotated
adata
, where the observations are cells and the variables are genes.Returns
Sets the following in the data:
- Variable (gene) annotations:
properly_sampled_gene
A mask of the “properly sampled” genes.
excluded_gene
A mask of the genes which were excluded (by name or due to not being properly sampled).
Computation Parameters
Invoke
metacells.tools.bursty_lonely.find_bursty_lonely_genes()
usingbursty_max_sampled_cells
(default: 10000),bursty_downsample_min_samples
(default: 750),bursty_downsample_min_cell_quantile
(default: 0.5),bursty_downsample_max_cell_quantile
(default: 0.05),bursty_min_gene_total
(default: 100),bursty_min_gene_normalized_variance
(default: 5.656854249492381),bursty_max_gene_similarity
(default: 0.1), andrandom_seed
. Any “bursty_lonely_genes” will be excluded.Invoke
metacells.tools.properly_sampled.find_properly_sampled_genes()
usingproperly_sampled_min_gene_total
(default: 1). Genes which are not properly sampled will be excluded.Invoke
metacells.tools.named.find_named_genes()
to also exclude genes based on their name, using theexcluded_gene_names
(default: None) andexcluded_gene_patterns
(default: None).
- metacells.pipeline.exclude.exclude_cells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, properly_sampled_min_cell_total: int | None, properly_sampled_max_cell_total: int | None, properly_sampled_max_excluded_genes_fraction: float | None, additional_cells_masks: List[str] | None = None) None [source]¶
Exclude a subset of the cells from the metacells computation.
You can also just manually set the
excluded_cell
mask, or further manipulate it after calling this function.Input
Annotated
adata
, where the observations are cells and the variables are genes. Optionally, may contain anexcluded_gene
mask of genes to be excluded from the metacells computation. That is, invoke this after callingexclude_genes()
(if you wish to exclude any genes).Returns
Sets the following in the full data:
- Observation (cell) annotations:
properly_sampled_cell
A mask of the “properly sampled” cells.
excluded_cell
A mask of the genes which were excluded (inverse of
properly_sampled_cell
).
Computation Parameters
Invoke
metacells.tools.properly_sampled.find_properly_sampled_cells()
usingproperly_sampled_min_cell_total
(no default),properly_sampled_max_cell_total
(no default) andproperly_sampled_max_excluded_genes_fraction
(no default).Exclude any cells which are not properly sampled (
|~properly_sampled_cell
), with optional additional followingadditional_cells_masks
(usingmetacells.tools.mask.combine_masks()
).
- metacells.pipeline.exclude.extract_clean_data(adata: AnnData, *, name: str | None = '.clean', top_level: bool = True) AnnData | None [source]¶
Extract a “clean” subset of the
adata
to compute metacells for.Input
Annotated
adata
, where the observations are cells and the variables are genes. This may contain per-geneexcluded_gene
and per-cellexcluded_cell
annotations.Returns
Annotated sliced data containing the “clean” subset of the original data (that is, everything not marked as excluded).
The returned data will have
full_cell_index
andfull_gene_index
per-observation (cell) and per-variable (gene) annotations to allow mapping the results back to the original data.Computation Parameters
This simply
metacells.tools.filter.filter_data()
to slice just the genes not in theexcluded_gene
mask and the cells not in theexcluded_cell
mask, data using thename
(default: .clean), and tracking the originalfull_cell_index
andfull_gene_index
.
Mark (Genes)¶
Even when working with “clean” data, computing metacells requires marking genes for special treatment, specifically “lateral” genes which should not be selected for computing metacells and “noisy” genes which should be neither selected as nor be used to detect deviant (outlier) cells in the metacells.
Proper choice of lateral and noisy genes is mandatory for high-quality metacells generation. However, this choice can’t be fully automated, and therefore relies on the analyst to manually provide the list of genes.
- metacells.pipeline.mark.mark_lateral_genes(adata: AnnData, *, lateral_gene_names: Collection[str] | None = None, lateral_gene_patterns: Collection[str | Pattern] | None = None, op: str = 'set') None [source]¶
Mark a subset of the genes as “lateral”, that is, prevent them from being selected for computing metacells.
Lateral genes are still used to detect outliers, that is, the cells in resulting metacells should still have similar expression level for such genes.
Depending on
op
, this will eitherset
(override/create) the mask,add
(to an existing mask), orremove
(from an existing mask).Input
Annotated
adata
, where the observations are cells and the variables are genes.Returns
Sets the following in the data:
- Variable (gene) annotations:
lateral_gene
A mask of the “lateral” genes.
Computation Parameters
Invoke
metacells.tools.named.find_named_genes()
to also mark lateral genes based on their name, using thelateral_gene_names
(default: None),lateral_gene_patterns
(default: None) andop
(default: set).
- metacells.pipeline.mark.mark_noisy_genes(adata: AnnData, *, noisy_gene_names: Collection[str] | None = None, noisy_gene_patterns: Collection[str | Pattern] | None = None, op: str = 'set') None [source]¶
Mark a subset of the genes as “noisy”, that is, prevent them from being used for detecting deviant (outlier) cells.
Depending on
op
, this will eitherset
(override/create) the mask,add
(to an existing mask), orremove
(from an existing mask).Input
Annotated
adata
, where the observations are cells and the variables are genes.Returns
Sets the following in the data:
- Variable (gene) annotations:
noisy_gene
A mask of the “noisy” genes.
Computation Parameters
Invoke
metacells.tools.named.find_named_genes()
to also mark noisy genes based on their name, using thenoisy_gene_names
(default: None),noisy_gene_patterns
(default: None) andop
(default: set).
- metacells.pipeline.mark.mark_select_genes(adata: AnnData, *, select_gene_names: Collection[str] | None = None, select_gene_patterns: Collection[str | Pattern] | None = None, op: str = 'set') None [source]¶
Mark a subset of the genes as “select”, that is, force them to be used when computing metacell.
If this is done, then it overrides the default gene selection mechanism, forcing the algorithm to use a fixed set of genes. In general, this will result in lower-quality metacells, especially when using the divide-and-conquer algorithm, so do not use this unless you really know what you are doing.
Depending on
op
, this will eitherset
(override/create) the mask,add
(to an existing mask), orremove
(from an existing mask).Input
Annotated
adata
, where the observations are cells and the variables are genes.Returns
Sets the following in the data:
- Variable (gene) annotations:
select_gene
A mask of the “select” genes.
Computation Parameters
Invoke
metacells.tools.named.find_named_genes()
to select genes based on their name, using theselect_gene_names
(default: None),select_gene_patterns
(default: None) andop
(default: set).
- metacells.pipeline.mark.mark_ignored_genes(adata: AnnData, *, ignored_gene_names: Collection[str] | None = None, ignored_gene_patterns: Collection[str | Pattern] | None = None, ignored_gene_names_of_types: Dict[str, Collection[str]] | None = None, ignored_gene_patterns_of_types: Dict[str, Collection[str]] | None = None, op: str = 'set') None [source]¶
Mark a subset of the genes as “ignored”, that is, do not attempt to match them when projecting this (query) data onto an atlas.
Depending on
op
, this will eitherset
(override/create) the mask(s),add
(to an existing mask(s)), orremove
(from an existing mask(s)).Input
Annotated
adata
, where the observations are cells and the variables are genes.Returns
Sets the following in the data:
- Variable (gene) annotations:
ignored_gene
A mask of the “ignored” genes for all the query metacells regardless of their type.
ignored_gene_of_<type>
A mask of the “ignored” genes for query metacells that are assigned a specific type.
Computation Parameters
Invoke
metacells.tools.named.find_named_genes()
to ignore genes based on their name, using theignored_gene_names
(default: None),ignored_gene_patterns
(default: None) andop
(default: set).Similarly for each type specified in
ignored_gene_names_of_types
and/orignored_gene_patterns_of_types
.
- metacells.pipeline.mark.mark_essential_genes(adata: AnnData, *, essential_gene_names_of_types: Dict[str, Collection[str]] | None = None, essential_gene_patterns_of_types: Dict[str, Collection[str]] | None = None, op: str = 'set') None [source]¶
Mark a subset of the genes as “essential”, that is, require that (most of them) will be projected “well” to accept a query metacell as having the same type as the atlas projection.
Depending on
op
, this will eitherset
(override/create) the mask(s),add
(to an existing mask(s)), orremove
(from an existing mask(s)).Input
Annotated
adata
, where the observations are cells and the variables are genes.Returns
Sets the following in the data:
- Variable (gene) annotations:
essential_gene_of_<type>
A mask of the “essential” genes for query metacells to be assigned a specific type.
Computation Parameters
Invoke
metacells.tools.named.find_named_genes()
to ignore genes based on their name, using theessential_gene_names_of_types
(default: None),essential_gene_patterns_of_types
(default: None) andop
(default: set).
Selection¶
- metacells.pipeline.select.extract_selected_data(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, name: str | None = '.select', downsample_min_samples: float = 750, downsample_min_cell_quantile: float = 0.05, downsample_max_cell_quantile: float = 0.5, min_gene_relative_variance: float | None, min_genes: int = 100, min_gene_total: int | None = 50, min_gene_top3: int | None = 4, additional_gene_masks: List[str] = ['&~lateral_gene'], random_seed: int, top_level: bool = True) AnnData [source]¶
Select a subset of
what
(default: __x__ data, to compute metacells by.When computing metacells (or clustering cells in general), it makes sense to use a subset of the genes for computing cell-cell similarity, for both technical (e.g., too low an expression level) and biological (e.g., ignoring bookkeeping and cell cycle genes) reasons. The steps provided here are expected to be generically useful, but as always specific data sets may require custom gene selection steps on a case-by-case basis.
Input
A presumably “clean” Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Will obey the following annotations in the full
adata
, if they exist:Variable (Gene) Annotations
select_gene
If exists, force a mask of genes to use as “select” genes, ignoring everything else.
lateral_gene
A boolean mask of genes which are lateral from being chosen as “select” genes based on their name.
Returns
Returns annotated sliced data containing the “select” subset of the original data. By default, the
name
of this data is .select. If no selects were selected, returnNone
.Also sets the following annotations in the full
adata
:- Unstructured Annotations
downsample_samples
The target total number of samples in each downsampled cell.
- Observation-Variable (Cell-Gene) Annotations:
downsampled
The downsampled data where the total number of samples in each cell is at most
downsample_samples
.
- Variable (Gene) Annotations
high_total_gene
A boolean mask of genes with “high” expression level (unless a
select_gene
mask exists).high_relative_variance_gene
A boolean mask of genes with “high” normalized variance, relative to other genes with a similar expression level (unless a
select_gene
mask exists).selected_gene
A boolean mask of the actually selected genes.
Computation Parameters
If a
select_gene
mask exists, just use these genes and go directly to the last step 6.Invoke
metacells.tools.downsample.downsample_cells()
to downsample the cells to the same total number of UMIs, using thedownsample_min_samples
(default: 750),downsample_min_cell_quantile
(default: 0.05),downsample_max_cell_quantile
(default: 0.5) and therandom_seed
(non-zero for reproducible results).Invoke
metacells.tools.high.find_high_total_genes()
to select high-expression genes (based on the downsampled data), usingmin_gene_total
.Invoke
metacells.tools.high.find_high_relative_variance_genes()
to select high-variance genes (based on the downsampled data), usingmin_gene_relative_variance
.Compute the set of genes that pass the above test, as well as match the
additional_gene_masks
(default: [’&~lateral_gene’]).If we found less than
min_genes
genes (default: 100, andmin_gene_relative_variance
was specified, try to achieve the required minimal number of genes by reducing themin_gene_relative_variance
. In extreme cases, give up on the relative variance requirement altogether.Invoke
metacells.tools.filter.filter_data()
to slice just the selected genes.
Direct¶
- metacells.pipeline.direct.compute_direct_metacells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, select_downsample_min_samples: int = 750, select_downsample_min_cell_quantile: float = 0.05, select_downsample_max_cell_quantile: float = 0.5, select_min_gene_total: int | None = 50, select_min_gene_top3: int | None = 4, select_min_gene_relative_variance: float | None = 0.1, select_min_genes: int = 100, cells_similarity_value_regularization: float = 0.14285714285714285, cells_similarity_log_data: bool = True, cells_similarity_method: str = 'abs_pearson', target_metacell_size: int = 48, min_metacell_size: int = 12, target_metacell_umis: int = 160000, cell_umis: ndarray | None = None, knn_k: int | None = None, knn_k_size_factor: float = 2, knn_k_umis_quantile: float = 0.1, min_knn_k: int | None = 30, knn_balanced_ranks_factor: float = 3.1622776601683795, knn_incoming_degree_factor: float = 3.0, knn_outgoing_degree_factor: float = 1.0, knn_min_outgoing_degree: int = 2, min_seed_size_quantile: float = 0.85, max_seed_size_quantile: float = 0.95, candidates_cooldown_pass: float = 0.02, candidates_cooldown_node: float = 0.25, candidates_cooldown_phase: float = 0.75, candidates_min_split_size_factor: float = 2.0, candidates_max_merge_size_factor: float = 0.5, candidates_max_split_min_cut_strength: float = 0.1, candidates_min_cut_seed_cells: int = 7, must_complete_cover: bool = False, deviants_policy: str = 'gaps', deviants_min_gene_fold_factor: float = 3.0, deviants_gap_skip_cells: int = 1, deviants_min_noisy_gene_fold_factor: float = 2.0, deviants_max_gene_fraction: float = 0.03, deviants_max_cell_fraction: float | None = 0.25, deviants_max_gap_cells_count: int = 3, deviants_max_gap_cells_fraction: float = 0.1, dissolve_min_robust_size_factor: float = 0.5, dissolve_min_convincing_gene_fold_factor: float | None = 3.0, random_seed: int) None [source]¶
Directly compute metacells using
what
(default: __x__) data.This directly computes the metacells on the whole data. Like any method that directly looks at the whole data at once, the amount of CPU and memory needed becomes unreasonable when the data size grows. Above O(10,000) you are much better off using the divide-and-conquer method.
Note
The current implementation is naive in that it computes the full dense N^2 correlation matrix, and only then extracts the sparse graph out of it. We actually need two copies where each requires 4 bytes per entry, so for O(100,000) cells, we have storage of O(100,000,000,000). In addition, the implementation is serial for the graph clustering phases.
It is possible to mitigate this by fusing the correlations phase and the graph generation phase, parallelizing the result, and also (somehow) parallelizing the graph clustering phase. This might increase the “reasonable” size for the direct approach to O(100,000).
We have decided not to invest in this direction since it won’t allow us to push the size to O(1,000,000) and above. Instead we provide the divide-and-conquer method, which easily scales to O(1,000,000) on a single multi-core server, and to “unlimited” size if we further enhance the implementation to use a distributed compute cluster of such servers.
Input
The presumably “clean” annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
Sets the following annotations in
adata
:- Unstructured Annotations
downsample_samples
The target total number of samples in each downsampled cell.
- Observation-Variable (Cell-Gene) Annotations:
downsampled
The downsampled data where the total number of samples in each cell is at most
downsample_samples
.
- Variable (Gene) Annotations
high_total_gene
A boolean mask of genes with “high” expression level (unless a
select_gene
mask exists).high_relative_variance_gene
A boolean mask of genes with “high” normalized variance, relative to other genes with a similar expression level (unless a
select_gene
mask exists).selected_gene
A boolean mask of the actually selected genes.
- Observation (Cell) Annotations
metacell
The integer index of the metacell each cell belongs to. The metacells are in no particular order. Cells with no metacell assignment (“outliers”) are given a metacell index of
-1
.
Computation Parameters
If
cell_umis
is not specified, use the sum of thewhat
data for each cell.Invoke
metacells.pipeline.select.extract_selected_data()
to extract “select” data from the clean data, using theselect_downsample_min_samples
(default: 750),select_downsample_min_cell_quantile
(default: 0.05),select_downsample_max_cell_quantile
(default: 0.5),select_min_gene_total
(default: 50),select_min_gene_top3
(default: 4),select_min_gene_relative_variance
(default: 0.1) andrandom_seed
.Compute the fractions of each variable in each cell, and add the
cells_similarity_value_regularization
(default: 0.14285714285714285) to it.If
cells_similarity_log_data
(default: True), invoke themetacells.utilities.computation.log_data()
function to compute the log (base 2) of the data.Invoke
metacells.tools.similarity.compute_obs_obs_similarity()
to compute the similarity between each pair of cells, using thecells_similarity_method
(default: abs_pearson).Invoke
metacells.tools.knn_graph.compute_obs_obs_knn_graph()
to compute a K-Nearest-Neighbors graph, using theknn_balanced_ranks_factor
(default: 3.1622776601683795),knn_incoming_degree_factor
(default: 3.0)knn_outgoing_degree_factor
(default: 1.0) andknn_min_outgoing_degree
(default: 2)(. Ifknn_k
(default: None) is not specified, then it is chosen to be theknn_k_size_factor
(default: 2 times the median number of cells required to reach the target metacell size, or theknn_k_umis_quantile
(default: 0.1) the number of cells required, ormin_knn_k
(default: 30), whichever is largest.Invoke
metacells.tools.candidates.compute_candidate_metacells()
to compute the candidate metacells, using themin_seed_size_quantile
(default: 0.85),max_seed_size_quantile
(default: 0.95),candidates_cooldown_pass
(default: 0.02),candidates_cooldown_node
(default: 0.25),candidates_cooldown_phase
(default: 0.75),candidates_min_split_size_factor
(default: 2.0),candidates_max_merge_size_factor
(default: 0.5), andrandom_seed
. This tries to build metacells of thetarget_metacell_size
(default: 48),min_metacell_size
(default: 12), andtarget_metacell_umis
(default: 160000) using thecell_umis
(default: None).Unless
must_complete_cover
(default: False), invokemetacells.tools.deviants.find_deviant_cells()
to remove deviants from the candidate metacells, using thedeviants_policy
(default: gaps),deviants_min_gene_fold_factor
(default: 3.0),deviants_gap_skip_cells
(default: 1),deviants_min_noisy_gene_fold_factor
(default: 2.0),deviants_max_gene_fraction
(default: 0.03) anddeviants_max_cell_fraction
(default: 0.25).
Unless
must_complete_cover
(default: False), invokemetacells.tools.dissolve.dissolve_metacells()
to dissolve small unconvincing metacells, using the sametarget_metacell_size
(default: 48),min_metacell_size
(default: 12),target_metacell_umis
(default: 160000),cell_umis
(default: None) and thedissolve_min_robust_size_factor
(default: 0.5).
Divide and Conquer¶
- metacells.pipeline.divide_and_conquer.set_max_parallel_piles(max_parallel_piles: int) None [source]¶
Set the (maximal) number of piles to compute in parallel.
By default, we use all the available hardware threads. Override this by setting the
METACELLS_MAX_PARALLEL_PILES
environment variable or by invoking this function from the main thread.A value of
0
will use all the available processors (seemetacells.utilities.parallel.set_processors_count()
). Otherwise, the value is the positive maximal number of processors to use in parallel for computing piles.It may be useful to restrict the number of parallel piles to restrict the total amount of memory used by the application, to keep it within the physical RAM available.
- metacells.pipeline.divide_and_conquer.get_max_parallel_piles() int [source]¶
Return the maximal number of piles to compute in parallel.
- metacells.pipeline.divide_and_conquer.guess_max_parallel_piles(cells: AnnData, what: str = '__x__', *, max_gbs: float = -0.1, target_metacell_umis: int = 160000, cell_umis: ndarray | None = None, target_metacell_size: int = 48, min_target_pile_size: int = 8000, max_target_pile_size: int = 24000, target_metacells_in_pile: int = 100) int [source]¶
Try and guess a reasonable maximal number of piles to use for computing metacells for the specified
cells
using at mostmax_gbs
of memory (default: -0.1, that is 90% of all the machine has - if zero or negative, is relative to the machines memory).The amount of memory used depends on the target pile size, so give this function the same parameters as for
metacells.pipeline.divide_and_conquer.divide_and_conquer_pipeline()
so it will use the same automatically adjusted target pile size.Note
This is only a best-effort guess. A too-low number would slow down the computation by using less processors than it could. A too-high number might cause the computation to crash, running out of memory. So: use with care, YMMV, keep an eye on memory usage and other applications running in parallel to the computation, and apply common sense.
Computation Parameters
Figure out the target pile size by invoking
metacells.pipeline.divide_and_conquer.compute_target_pile_size()
using thetarget_metacell_umis
(default: 160000),cell_umis
(default: None),target_metacell_size
(default: 48),min_target_pile_size
(default: 8000),max_target_pile_size
(default: 24000) andtarget_metacells_in_pile
(default: 100).Use
psutil.virtual_memory
to query the amount of memory in the system, and applymax_gbs
(default: -0.1) off the top (e.g.,-0.1
means reduce it by 10%).Guesstimate the number of piles that, if computed in parallel, will consume this amount of memory.
- metacells.pipeline.divide_and_conquer.compute_target_pile_size(adata: AnnData, what: str = '__x__', *, cell_umis: ndarray | None = None, target_metacell_size: int = 48, target_metacell_umis: int = 160000, min_target_pile_size: int = 8000, max_target_pile_size: int = 24000, target_metacells_in_pile: int = 100) int [source]¶
Compute how many cells should be in each divide-and-conquer pile.
Larger piles are slower to process (O(N^2)) than smaller piles, but (up to a point) they allow the algorithm to produce better results. The ideal pile size is just big enough to allow for a sufficient number of metacells to be created; specifically, we’d like the number of metacells in each pile to be roughly the same as the number of cells in a metacell, which is natural for the divide and conquer algorithm.
Note that the target pile size is just a target. It is directly used in the 1st phase of the divide-and-conquer algorithm, but the pile sizes of the 2nd phase are determined by clustering metacells together which inevitably cause the actual pile sizes to vary around this number.
Computation Parameters
If
cell_umis
is not specified, use the sum of thewhat
data for each cell.The natural pile size is the product of the
target_metacell_size
(default: 48) and thetarget_metacells_in_pile
(default: 100).Also consider the mean number of cells needed to reach the
target_metacell_umis
(default: 160000); this times thetarget_metacells_in_pile
gives us another pile size, and if this one is larger than what we computed in the previous step, we use it instead.Clamp this value to be no lower than
min_target_pile_size
(default: 8000) and no higher thanmax_target_pile_size
(default: 24000). That is, if you set both to the same value, this will force the target pile size value.
- metacells.pipeline.divide_and_conquer.divide_and_conquer_pipeline(adata: AnnData, what: str = '__x__', *, rare_max_genes: int = 500, rare_max_gene_cell_fraction: float = 0.001, rare_min_gene_maximum: int = 7, rare_genes_similarity_method: str = 'repeated_pearson', rare_genes_cluster_method: str = 'ward', rare_min_genes_of_modules: int = 4, rare_min_cells_of_modules: int = 12, rare_min_module_correlation: float = 0.1, rare_min_related_gene_fold_factor: float = 7, rare_max_related_gene_increase_factor: float = 4.0, rare_min_cell_module_total: int = 4, rare_max_cells_factor_of_random_pile: float = 0.5, rare_deviants_max_cell_fraction: float | None = 0.25, rare_dissolve_min_robust_size_factor: float | None = 0.5, rare_dissolve_min_convincing_gene_fold_factor: float = 3.0, quick_and_dirty: bool = False, select_downsample_min_samples: int = 750, select_downsample_min_cell_quantile: float = 0.05, select_downsample_max_cell_quantile: float = 0.5, select_min_gene_total: int | None = 50, select_min_gene_top3: int | None = 4, select_min_gene_relative_variance: float | None = 0.1, select_min_genes: int = 100, cells_similarity_value_regularization: float = 0.14285714285714285, cells_similarity_log_data: bool = True, cells_similarity_method: str = 'abs_pearson', groups_similarity_log_data: bool = True, groups_similarity_method: str = 'abs_pearson', target_metacell_umis: int = 160000, cell_umis: ndarray | None = None, target_metacell_size: int = 48, min_metacell_size: int = 12, target_metacells_in_pile: int = 100, min_target_pile_size: int = 8000, max_target_pile_size: int = 24000, piles_knn_k_size_factor: float = 3, piles_min_split_size_factor: float = 1.5, piles_min_robust_size_factor: float = 0.3, piles_max_merge_size_factor: float = 0.15, knn_k: int | None = None, knn_k_umis_quantile: float = 0.1, min_knn_k: int | None = 30, knn_balanced_ranks_factor: float = 3.1622776601683795, knn_incoming_degree_factor: float = 3.0, knn_outgoing_degree_factor: float = 1.0, knn_min_outgoing_degree: int = 2, min_seed_size_quantile: float = 0.85, max_seed_size_quantile: float = 0.95, candidates_knn_k_size_factor: float = 2, candidates_cooldown_pass: float = 0.02, candidates_cooldown_node: float = 0.25, candidates_cooldown_phase: float = 0.75, candidates_min_split_size_factor: float = 2.0, candidates_max_merge_size_factor: float = 0.5, candidates_max_split_min_cut_strength: float | None = 0.1, candidates_min_cut_seed_cells: int = 7, must_complete_cover: bool = False, deviants_policy: str = 'gaps', deviants_gap_skip_cells: int = 1, deviants_min_gene_fold_factor: float = 3.0, deviants_min_noisy_gene_fold_factor: float = 2.0, deviants_max_gene_fraction: float | None = 0.03, deviants_max_cell_fraction: float | None = 0.25, deviants_max_gap_cells_count: int = 3, deviants_max_gap_cells_fraction: float = 0.1, dissolve_min_robust_size_factor: float | None = 0.5, dissolve_min_convincing_gene_fold_factor: float = 3.0, random_seed: int) None [source]¶
Complete pipeline using divide-and-conquer to compute the metacells for the
what
(default: __x__) data.If a
progress_bar
is active, progress will be reported into the current (slice of) the progress bar. The detection of rare gene modules is not counted in the progress bar; instead it is logged at theINFO
level.Note
This is applicable to “any size” data. If the data is “small” (O(10,000)), it will revert to using the direct metacell computation (but will still by default first look for rare gene modules). If the data is “large” (up to O(10,000,000)), this will be much faster and will require much less memory than using the direct approach. The current implementation is not optimized for “huge” data (O(1,000,000,000)) - that is, it will work, and keep will use a limited amount of memory, but a faster implementation would distribute the computation across multiple servers.
Input
The presumably “clean” annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
Sets the following annotations in
adata
:Everything set by
compute_divide_and_conquer_metacells
; and in addition:- Variable (Gene) Annotations
rare_gene_module_<N>
A boolean mask for the genes in the rare gene module
N
.rare_gene
A boolean mask for the genes in any of the rare gene modules.
- Observations (Cell) Annotations
cells_rare_gene_module
The index of the rare gene module each cell expresses the most, or
-1
in the common case it does not express any rare genes module.rare_cell
A boolean mask for the (few) cells that express a rare gene module.
Computation Parameters
If
cell_umis
is not specified, use the sum of thewhat
data for each cell.Invoke
metacells.pipeline.divide_and_conquer.compute_target_pile_size()
usingtarget_metacell_umis
(default: 160000),cell_umis
(default: None),target_metacell_size
(default: 48),min_target_pile_size
(default: 8000),max_target_pile_size
(default: 24000) andtarget_metacells_in_pile
(default: 100).Invoke
metacells.tools.rare.find_rare_gene_modules()
to isolate cells expressing rare gene modules, using therare_max_genes
(default: 500),rare_max_gene_cell_fraction
(default: 0.001),rare_min_gene_maximum
(default: 7),rare_genes_similarity_method
(default: repeated_pearson),rare_genes_cluster_method
(default: ward),rare_min_genes_of_modules
(default: 4),rare_min_cells_of_modules
(default: 12),rare_min_module_correlation
(default: 0.1),rare_min_related_gene_fold_factor
(default: 7)rare_max_related_gene_increase_factor
(default: 4.0)rare_max_cells_factor_of_random_pile
(default: 0.5) andrare_min_cell_module_total
(default: 4). Use a non-zerorandom_seed
to make this reproducible.For each detected rare gene module, collect all cells that express the module, and apply
metacells.pipeline.direct.compute_direct_metacells()
to them.Collect all the outliers from the above together with the rest of the cells (which express no rare gene module) and apply the
compute_divide_and_conquer_metacells()
to the combined result. The annotations for rare-but-outlier cells (e.g.cell_directs
) will not reflect the work done when computing the rare gene module metacells which they were rejected from.
- metacells.pipeline.divide_and_conquer.compute_divide_and_conquer_metacells(adata: AnnData, what: str = '__x__', *, quick_and_dirty: bool = False, select_downsample_min_samples: int = 750, select_downsample_min_cell_quantile: float = 0.05, select_downsample_max_cell_quantile: float = 0.5, select_min_gene_total: int | None = 50, select_min_gene_top3: int | None = 4, select_min_gene_relative_variance: float | None = 0.1, select_min_genes: int = 100, cells_similarity_value_regularization: float = 0.14285714285714285, cells_similarity_log_data: bool = True, cells_similarity_method: str = 'abs_pearson', groups_similarity_log_data: bool = True, groups_similarity_method: str = 'abs_pearson', target_metacell_umis: int = 160000, cell_umis: ndarray | None = None, target_metacell_size: int = 48, min_metacell_size: int = 12, min_target_pile_size: int = 8000, max_target_pile_size: int = 24000, target_metacells_in_pile: int = 100, piles_knn_k_size_factor: float = 3, piles_min_split_size_factor: float = 1.5, piles_min_robust_size_factor: float = 0.3, piles_max_merge_size_factor: float = 0.15, knn_k: int | None = None, knn_k_umis_quantile: float = 0.1, min_knn_k: int | None = 30, knn_balanced_ranks_factor: float = 3.1622776601683795, knn_incoming_degree_factor: float = 3.0, knn_outgoing_degree_factor: float = 1.0, knn_min_outgoing_degree: int = 2, min_seed_size_quantile: float = 0.85, max_seed_size_quantile: float = 0.95, candidates_cooldown_pass: float = 0.02, candidates_cooldown_node: float = 0.25, candidates_cooldown_phase: float = 0.75, candidates_knn_k_size_factor: float = 2, candidates_min_split_size_factor: float = 2.0, candidates_max_merge_size_factor: float = 0.5, candidates_max_split_min_cut_strength: float | None = 0.1, candidates_min_cut_seed_cells: int = 7, must_complete_cover: bool = False, deviants_policy: str = 'gaps', deviants_gap_skip_cells: int = 1, deviants_min_gene_fold_factor: float = 3.0, deviants_min_noisy_gene_fold_factor: float = 2.0, deviants_max_gene_fraction: float | None = 0.03, deviants_max_cell_fraction: float | None = 0.25, deviants_max_gap_cells_count: int = 3, deviants_max_gap_cells_fraction: float = 0.1, dissolve_min_robust_size_factor: float | None = 0.5, dissolve_min_convincing_gene_fold_factor: float = 3.0, random_seed: int, _fdata: AnnData | None = None, _counts: Dict[str, int] | None = None) None [source]¶
Compute metacells for
what
(default: {what}) data using the divide-and-conquer method.This divides large data to smaller “piles” and directly computes metacells for each, then generates new piles of “similar” metacells and directly computes the final metacells from each such pile. Due to this divide-and-conquer approach, the total amount of memory required is bounded (by the pile size), and the amount of total CPU grows slowly with the problem size, allowing this method to be applied to very large data (millions of cells).
If a progress bar is active, progress will be reported into the current (slice of) the progress bar.
Input
The presumably “clean” annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
Sets the following annotations in
adata
:- Observations (Cell) Annotations
metacell
The integer index of the final metacell each cell belongs to. The metacells are in no particular order. This is
-1
for cells not included in any metacell (outliers, excluded).dissolve
A boolean mask of the cells which were in a dissolved metacell (in the last pile they were in).
Computation Parameters
If
cell_umis
is not specified, use the sum of thewhat
data for each cell.Invoke
metacells.pipeline.divide_and_conquer.compute_target_pile_size()
usingtarget_metacell_umis
(default: {target_metacell_umis}),cell_umis
(default: {cell_umis}),target_metacell_size
(default: {target_metacell_size}),min_target_pile_size
(default: {min_target_pile_size}),max_target_pile_size
(default: {max_target_pile_size}) andtarget_metacells_in_pile
(default: {target_metacells_in_pile}).If the data is smaller than the target_pile_size times the
piles_min_split_size_factor
(default: {piles_min_split_size_factor}), then just invokemetacells.pipeline.direct.compute_direct_metacells()
using the parameters. Otherwise, perform the following steps.Phase 0 (preliminary): group the cells randomly into equal-sized piles of roughly the
target_pile_size
using therandom_seed
(default: {random_seed}) to allow making this replicable. Otherwise, use the groups of metacells computed in the preliminary phase to create the piles of the final phase.Compute metacells using the piles by invoking
metacells.pipeline.direct.compute_direct_metacells()
, possibly in parallel.Collect the outliers from all the piles and recursively compute metacells for them. Continue to compute metacells for the outliers-of-outliers, etc. until all the outliers fit in a single pile; then compute metacells for this outlier-most cells with
must_complete_cover=True
.Phase 0 to 1: Recursively run the algorithm grouping the preliminary metacells into groups-of-metacells, using
piles_knn_k_size_factor
(default: {piles_knn_k_size_factor}),piles_min_split_size_factor
(default: {piles_min_split_size_factor}),piles_min_robust_size_factor
(default: {piles_min_robust_size_factor}), andpiles_max_merge_size_factor
(default: {piles_max_merge_size_factor}).Phase 1 (final): group the cells into piles based on the groups-of-metacells. Compute metacells using the piles, by invoking
metacells.pipeline.direct.compute_direct_metacells()
, possibly in parallel.Collect the outliers from all the piles and compute metacells for them (possibly in parallel). If
quick_and_dirty
, stop. Otherwise, continue to group all the cells in metacells by grouping the outliers-of-outliers similarly to step 5.Phase 1 to 2: Recursively run the algorithm grouping the outlier metacells into groups-of-metacells, using
piles_knn_k_size_factor
(default: {piles_knn_k_size_factor}),piles_min_split_size_factor
(default: {piles_min_split_size_factor}),piles_min_robust_size_factor
(default: {piles_min_robust_size_factor}), andpiles_max_merge_size_factor
(default: {piles_max_merge_size_factor}).Phase 2 (outliers): Group the cells into piles based on the groups-of-metacells. Compute metacells using the piles, by invoking
metacells.pipeline.direct.compute_direct_metacells()
, possibly in parallel. Stop here and accept all outliers as final outliers (that is, there are no level 2 metacells).
Collect¶
- metacells.pipeline.collect.collect_metacells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, metacell_geo_mean: bool = True, metacell_umis_regularization: float = 0.0625, zeros_cell_size_quantile: float = 0.1, groups: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'metacell', name: str = 'metacells', prefix: str | None = 'M', top_level: bool = True, _metacell_groups: bool = False, random_seed: int) AnnData [source]¶
Collect computed metacells
what
(default: {what}) data.Input
Annotated (presumably “clean”)
adata
, where the observations are cells and the variables are genes, and wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.Returns
Annotated metacell data containing for each observation (metacell) for each variable (gene) the fraction of the gene in the metacell, and the following annotations:
- Unstructured (Scalar)
outliers
The number of outlier cells (which were not assigned to any metacell).
metacells_algorithm
The version of the algorithm generating the metacell (based on the package version number).
- Variable (Gene) Annotations
*
Every per-gene annotations of
adata
is copied to the result.
- Observations (Metacell) Annotations
grouped
The number of (“clean”) cells grouped into each metacell.
total_umis
The total of all the UMIs of all the genes of all the cells grouped into the metacell.
- Observations-Variables (Metacell-Gene) Annotations
total_umis
The total of all the UMIs of each genes in all the cells grouped into the metacell.
Also sets in the full
adata
:- Observations (Cell) Annotations
metacell_name
The string name of the metacell, which is
prefix
(default: {prefix}) followed by the metacell index, followed by.checksum
where the checksum is a 2 digits, reflecting the (names of the) set of cells grouped into each metacell. This keeps metacell names short, and provides protection against assigning per-metacell annotations from a different metacell computation to the new results (unless they are identical).
Computation Parameters
For each metacell:
If
metacell_geo_mean
(default: {metacell_geo_mean}),Compute the fraction of each gene out of each cell grouped into the metacell.
For each cell, add to the fractions the
metacell_umis_regularization
divided by the total UMIs of the cell.For each gene, compute the weighted geomean of these fractions across all the cells, where the weight of each cell is the log of its total number of UMIs, and
Subtract the geomean of the per-cell regularization so all-zero genes would have a zero fraction.
Otherwise, for each metacell, sum the total UMIs of each gene across all cells, and divide it by the total UMIs of all genes in all cells.
Normalize the per-gene fractions so their sum would be 1.0 in the metacell.
Compute the
zeros_cell_size_quantile
of the total UMIs of the cells to pick the number of UMIs to use for zeros computation.For each gene in each cell, compute the probability it would be zero if we downsampled the cell to have this number of UMIs.
Use these probabilities to decide whether the each gene is “effectively zero” in each cell.
For each gene, count the number of cells in which it is “effectively zero”.
In addition:
Pass all relevant per-gene annotations from
adata
to the result.Set the
metacell_name
property of each cell inadata
to the name of the group (metacell) it belongs to. Cells which do not belong to any metacell are assigned the metacell nameOutliers
.
UMAP¶
It is useful to project the metacells data to a 2D scatter plot (where each point is a metacell). The steps provided here are expected to yield a reasonable such projection, though as always you might need to tweak the parameters or even the overall flow for specific data sets.
- metacells.pipeline.umap.compute_knn_by_markers(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, marker_gene_names: Collection[str] | None = None, marker_gene_patterns: Collection[str | Pattern] | None = None, max_marker_genes: int | None = 1000, ignore_lateral_genes: bool = True, similarity_value_regularization: float = 1e-05, similarity_log_data: bool = True, similarity_method: str = 'logistics_abs_pearson', logistics_location: float = 0.8, logistics_slope: float = 0.5, k: int, balanced_ranks_factor: float = 3.1622776601683795, incoming_degree_factor: float = 3.0, outgoing_degree_factor: float = 1.0, min_outgoing_degree: int = 1, reproducible: bool) PandasFrame [source]¶
Compute KNN graph between metacells based on marker genes.
If
reproducible
isTrue
, a slower (still parallel) but reproducible algorithm will be used to compute Pearson correlations.Input
Annotated
adata
where each observation is a metacells and the variables are genes, are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix. Should contain amarker_gene
mask unless explicitly specifying the marker genes.Returns
Sets the following in
adata
:- Observations-Pair (Metacells) Annotations
obs_outgoing_weights
A sparse square matrix where each non-zero entry is the weight of an edge between a pair of cells or genes, where the sum of the weights of the outgoing edges for each element is 1 (there is always at least one such edge).
Also return a pandas data frame of the similarities between the observations (metacells).
Computation Parameters
If
marker_gene_names
and/ormarker_gene_patterns
were specified, use the matching genes. Otherwise, use themarker_gene
mask.If
ignore_lateral_genes
(default: True), then remove any genes marked as lateral from the mask.If
max_marker_genes
(default: 1000) is notNone
, then pick this number of marker genes with the highest variance.Compute the fractions of each
marker_gene
in each cell, and add thesimilarity_value_regularization
(default: 1e-05) to it.If
similarity_log_data
(default: True), invoke themetacells.utilities.computation.log_data()
function to compute the log (base 2) of the data.Invoke
metacells.tools.similarity.compute_obs_obs_similarity()
usingsimilarity_method
(default: logistics_abs_pearson),logistics_location
(default: 0.5) andlogistics_slope
(default: 0.5) and convert this to distances.Invoke
metacells.tools.knn_graph.compute_obs_obs_knn_graph()
using the distances,k
(no default!),balanced_ranks_factor
(default: 3.1622776601683795),incoming_degree_factor
(default: 3.0),outgoing_degree_factor
(default: 1.0) andmin_outgoing_degree
(default: 1) to compute a “skeleton” graph to overlay on top of the UMAP graph.
- metacells.pipeline.umap.compute_umap_by_markers(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, marker_gene_names: Collection[str] | None = None, marker_gene_patterns: Collection[str | Pattern] | None = None, max_marker_genes: int | None = 1000, ignore_lateral_genes: bool = True, similarity_value_regularization: float = 1e-05, similarity_log_data: bool = True, similarity_method: str = 'logistics_abs_pearson', logistics_location: float = 0.8, logistics_slope: float = 0.5, skeleton_k: int = 4, balanced_ranks_factor: float = 3.1622776601683795, incoming_degree_factor: float = 3.0, outgoing_degree_factor: float = 1.0, min_outgoing_degree: int = 1, umap_k: int = 15, dimensions: int = 2, min_dist: float = 0.5, spread: float = 1.0, random_seed: int) None [source]¶
Compute a UMAP projection of the (meta)cells.
Input
Annotated
adata
where each observation is a metacells and the variables are genes, are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix. Should contain amarker_gene
mask unless explicitly specifying the marker genes.Returns
Sets the following annotations in
adata
:- Observation-Observation (Metacell-Metacell) Annotations
umap_distances
A sparse symmetric matrix of the graph of distances between the metacells.
- Observation (Metacell) Annotations
x
,y
(ifdimensions
is 2)The X and Y coordinates of each metacell in the UMAP projection.
u
,v
,w
(ifdimensions
is 3)The U, V, W coordinates of each metacell in the UMAP projection.
Computation Parameters
Invoke
metacells.pipeline.umap.compute_knn_by_markers()
usingmarker_gene_names
(default: None),marker_gene_patterns
(default: None),max_marker_genes
(default: 1000),ignore_lateral_genes
(default: True),similarity_value_regularization
(default: 1e-05),similarity_log_data
(default: True),similarity_method
(default: logistics_abs_pearson),logistics_location
(default: 0.8),logistics_slope
(default: 0.5),skeleton_k
(default: 4),balanced_ranks_factor
(default: 3.1622776601683795),incoming_degree_factor
(default: 3.0) andoutgoing_degree_factor
(default: 1.0) andmin_outgoing_degree
(default: 1) to compute a “skeleton” graph to overlay on top of the UMAP graph. Specify a non-zerorandom_seed
to make this reproducible.Invoke
metacells.tools.layout.umap_by_distances()
using the distances,umap_k
(default: 15),min_dist
(default: 0.5),spread
(default: 1.0), dimensions (default: 2) andrandom_seed
.
Note
Keep in mind that the KNN graph used by UMAP (controlled by
umap_k
) is not identical to the KNN graph we compute (controlled byskeleton_k
). By default, we chooseskeleton_k < umap_k
, as the purpose of the skeleton KNN is to highlight the “strong” structure of the data; in practice this strong skeleton is highly compatible with the structure used by UMAP, so it serves it purpose reasonably well. It would have been nice to make these compatible, but UMAP is not friendly towards dictating a KNN graph from the outside.
Projection¶
- metacells.pipeline.projection.projection_pipeline(what: str = '__x__', *, adata: AnnData, qdata: AnnData, only_atlas_marker_genes: bool = True, only_query_marker_genes: bool = False, ignore_atlas_lateral_genes: bool = True, ignore_query_lateral_genes: bool = True, consider_atlas_noisy_genes: bool = True, consider_query_noisy_genes: bool = True, misfit_min_metacells_fraction: float = 0.5, project_log_data: bool = True, project_fold_regularization: float = 1e-05, project_candidates_count: int = 50, project_min_candidates_fraction: float = 0.3333333333333333, project_min_significant_gene_umis: int = 40, project_min_usage_weight: float = 1e-05, project_filter_ranges: bool = True, project_ignore_range_quantile: float = 0.02, project_ignore_range_min_overlap_fraction: float = 0.5, project_min_query_markers_fraction: float = 0.3333333333333333, project_max_consistency_fold_factor: float = 2.0, project_max_projection_fold_factor: float = 3.0, project_max_projection_noisy_fold_factor: float = 2.0, project_max_misfit_genes: int = 3, project_min_essential_genes_fraction: float | None = 0.75, atlas_type_property_name: str = 'type', project_corrections: bool = False, project_min_corrected_gene_correlation: float = 0.8, project_min_corrected_gene_factor: float = 0.15, reproducible: bool, top_level_parallel: bool = True) CompressedMatrix [source]¶
Complete pipeline for projecting query metacells onto an atlas of metacells for the
what
(default: __x__) data.Input
Annotated query
qdata
and atlasadata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix, containing the fraction of each gene in each cell. The atlas should also contain atype
per-observation (metacell) annotation.The
qdata
may include per-gene masks,ignored_gene
andignored_gene_of_<type>
, which force the code below to ignore the marked genes from either the preliminary projection or the following refined type-specific projections.Returns
A matrix whose rows are query metacells and columns are atlas metacells, where each entry is the weight of the atlas metacell in the projection of the query metacells. The sum of weights in each row (that is, for a single query metacell) is 1. The weighted sum of the atlas metacells using these weights is the “projected” image of the query metacell onto the atlas.
- Variable (Gene) Annotations
atlas_gene
A mask of the query genes that also exist in the atlas. We match genes by their name; if projecting query data from a different technology, we expect the caller to modify the query gene names to match the atlas before projecting it.
atlas_lateral_gene
,atlas_noisy_gene
,atlas_marker_gene
,essential_gene_of_<type>
Copied from the atlas to the query (
False
for non-atlas_gene
).projected_noisy_gene
The mask of the genes that were considered “noisy” when computing the projection. By default this is the union of the noisy atlas and query genes.
correction_factor
(ifproject_corrections
)If projecting a query on an atlas with different technologies (e.g., 10X v3 to 10X v2), an automatically computed factor we multiplied the query gene fractions by to compensate for the systematic difference between the technologies (1.0 for uncorrected genes and 0.0 for non-
atlas_gene
).fitted_gene_of_<type>
For each type, the genes that were projected well from the query to the atlas for most cells of that type; any
atlas_gene
outside this mask failed to project well from the query to the atlas for most metacells of this type. For non-atlas_gene
this is set toFalse
.misfit_gene_of_<type>
For each query metacell type, a boolean mask indicating whether the gene has a strong bias in the query metacells of this type compared to the atlas metacells of this type.
ignored_gene_of_<type>
For each query metacell type, a boolean mask indicating whether the gene was ignored by the projection (for any reason) when computing the projection for metacells of this type.
- Observation (Cell) Annotations
total_atlas_umis
The total UMIs of the
atlas_gene
in each query metacell. This is used in the analysis as described fortotal_umis
above, that is, to ensure comparing expression levels will ignore cases where the total number of UMIs of both compared gene profiles is too low to make a reliable determination. In such cases we take the fold factor to be 0.projected_type
For each query metacell, the best atlas
type
we can assign to it based on its projection. Note this does not indicate that the query metacell is “truly” of this type; to make this determination one needs to look at the quality control data below.projected_secondary_type
In some cases, a query metacell may fail to project well to a single region of the atlas, but does project well to a combination of two distinct atlas regions. This may be due to the query metacell containing doublets, of a mixture of cells which match different atlas regions (e.g. due to sparsity of data in the query data set). Either way, if this happens, we place here the type that best describes the secondary region the query metacell was projected to; otherwise this would be the empty string. Note that the
weights
matrix above does not distinguish between the regions.projected_correlation
per query metacellThe correlation between between the
corrected_fraction
and theprojected_fraction
for only thefitted_gene
expression levels of each query metacell. This serves as a very rough estimator for the quality of the projection for this query metacell (e.g. can be used to compute R^2 values).In general we expect high correlation (more than 0.9 in most metacells) since we restricted the
fitted_gene
mask only to genes we projected well.similar
mask per query metacellA conservative determination of whether the query metacell is “similar” to its projection on the atlas. This is based on whether the number of
misfit_gene
for the query metacell is low enough (by default, up to 3 genes), and also that at least 75% of theessential_gene
of the query metacell were notmisfit_gene
. Note that this explicitly allows for aprojected_secondary_type
, that is, a metacell of doublets will be “similar” to the atlas, but a metacell of a novel state missing from the atlas will be “dissimilar”.The final determination of whether to accept the projection is, as always, up to the analyst, based on prior biological knowledge, the context of the collection of the query (and atlas) data sets, etc. The analyst need not (indeed, should not) blindly accept the
similar
determination without examining the rest of the quality control data listed above.
- Observation-Variable (Cell-Gene) Annotations
corrected_fraction
per gene per query metacellFor each
atlas_gene
, its fraction in each query metacell, out of only the atlas genes. This may be further corrected (see below) if projecting between different scRNA-seq technologies (e.g. 10X v2 and 10X v3). For non-atlas_gene
this is 0.projected_fraction
per gene per query metacellFor each
atlas_gene
, its fraction in its projection on the atlas. This projection is computed as a weighted average of some atlas metacells (see below), which are all sufficiently close to each other (in terms of gene expression), so averaging them is reasonable to capture the fact the query metacell may be along some position on some gradient that isn’t an exact match for any specific atlas metacell. For non-atlas_gene
this is 0.fitted
mask per gene per query metacellFor each
atlas_gene
for each query metacell, whether the gene was expected to be projected well, based on the query metacellprojected_type
(and theprojected_secondary_type
, if any). For non-atlas_gene
this is set toFalse
. This does not guarantee the gene was actually projected well.misfit
For each
atlas_gene
for each query metacell, whether thecorrected_fraction
of the gene was significantly different from theprojected_fractions
(that is, whether the gene was not projected well for this metacell). For non-atlas_gene
this is set toFalse
, to make it easier to identify problematic genes.This is expected to be rare for
fitted_gene
and common for the rest of theatlas_gene
. If too manyfitted_gene
are alsomisfit_gene
, then one should be suspicious whether the query metacell is “truly” of theprojected_type
.essential
Which of the
atlas_gene
were also listed in theessential_gene_of_<type>
for theprojected_type
(and also theprojected_secondary_type
, if any) of each query metacell.If an
essential_gene
is also amisfit_gene
, then one should be very suspicious whether the query metacell is “truly” of theprojected_type
.projected_fold
per gene per query metacellThe fold factor between the
corrected_fraction
and theprojected_fraction
(0 for non-atlas_gene
). If the absolute value of this is high (3 for 8x ratio) then the gene was not projected well for this metacell. This will be 0 for non-atlas_gene
.It is expected this would have low values for most
fitted_gene
and high values for the rest of theatlas_gene
, but specific values will vary from one query metacell to another. This allows the analyst to make fine-grained determination about the quality of the projection, and/or identify quantitative differences between the query and the atlas (e.g., when studying perturbed systems such as knockouts or disease models).
Computation Parameters
Find the subset of genes that exist in both the query and the atlas. All computations will be done on this common subset. Normalize the fractions of the fitted gene fractions to sum to 1 in each metacell.
Compute preliminary projection:
Compute a mask of fitted genes, ignoring any genes included in
ignored_gene
. Ifonly_atlas_marker_genes
(default:only_atlas_marker_genes
), ignore any non-marker_gene
of the atlas. Ifonly_query_marker_genes
(default:only_query_marker_genes
), ignore any non-marker_gene
of the query. Ifignore_atlas_lateral_genes
(default: True), ignore thelateral_gene
of the atlas. Ifignore_query_lateral_genes
(default: True), ignore thelateral_gene
of the atlas. Normalize the fractions of the fitted gene fractions to sum to 1 in each metacell.Invoke
metacells.tools.project.compute_projection_weights()
to project each query metacell onto the atlas, using theproject_log_data
(default: True),project_fold_regularization
(default: 1e-05),project_min_significant_gene_umis
(default: 40),project_max_consistency_fold_factor
(default: 2.0),project_candidates_count
(default: 50),project_min_usage_weight
(default: 1e-05), andreproducible
.If
project_corrections
(default: False: Correlate the expression levels of each gene between the query and projection. If this is at leastproject_min_corrected_gene_correlation
(default: 0.8), compute the ratio between the mean expression of the gene in the projection and the query. If this is at most 1/(1+``project_min_corrected_gene_factor``) or at least (1+``project_min_corrected_gene_factor``) (default: 0.15), then multiply the gene’s value by this factor so its level would match the atlas. As usual, ignore genes which do not have at leastproject_min_significant_gene_umis
. If any genes were corrected, then repeat steps 2-3 (but do these steps no more than 3 times).If
project_filter_ranges
(default: True): Compute for each gene its expression range (lowest and highestproject_ignore_range_quantile
(default: 0.02) in both the projected and the corrected values. Compute the overlap between these ranges (shared range divided by the query range). If this is less thanproject_ignore_range_min_overlap_fraction
(default: 0.5), then ignore the gene. If any genes were ignored, repeat steps 2-4 (but do this step no more than 3 times).Invoke
metacells.tools.project.convey_atlas_to_query()
to assign a projected type to each of the query metacells based on theatlas_type_property_name
(default: type).
Then, for each type of query metacells:
If
top_level_parallel
(default:top_level_parallel
), do this in parallel. This seems to work better than doing this serially and using parallelism within each type. However this is still inefficient compared to using both types of parallelism at once, which the code currently can’t do without non-trivial coding (this would have been trivial in Julia…).Further reduce the mask of type-specific fitted genes by ignoring any genes in
ignored_gene_of_<type>
, if this annotation exists in the query. Normalize the sum of the fitted gene fractions to 1 in each metacell.Invoke
metacells.tools.project.compute_projection_weights()
to project each query metacell of the type onto the atlas. Note that even though we only look at query metacells (tentatively) assigned the type, their projection on the atlas may use metacells of any type.Invoke
metacells.tools.quality.compute_projected_folds()
to compute the significant fold factors between the query and its projection.Identify type-specific misfit genes whose fold factor is above
project_max_projection_fold_factor
. Ifconsider_atlas_noisy_genes
and/orconsider_query_noisy_genes
, then any gene listed in either is allowed an additionalproject_max_projection_noisy_fold_factor
. If any gene has such a high fold factor in at leastmisfit_min_metacells_fraction
, remove it from the fitted genes mask and repeat steps 6-9.Invoke
metacells.tools.quality.compute_similar_query_metacells()
to verify which query metacells ended up being sufficiently similar to their projection, usingproject_max_consistency_fold_factor
(default: 2.0),project_max_projection_noisy_fold_factor
(default: 2.0),project_max_misfit_genes
(default: 3), and if needed, the theessential_gene_of_<type>
. Also compute the correlation between the corrected and projected gene fractions for each metacell.
And then:
Invoke
metacells.tools.project.convey_atlas_to_query()
to assign an updated projected type to each of the query metacells based on theatlas_type_property_name
(default: type). If this changed the type assigned to any query metacell, repeat steps 6-11 (but do this step no more than 3 times).
For each query metacell that ended up being dissimilar, try to project them as a combination of two atlas regions:
Reduce the list of fitted genes for the metacell based on the
ignored_gene_of_<type>
for both the primary (initially from the above) query metacell type and the secondary query metacell type (initially empty). Normalize the sum of the gene fractions in the metacell to 1.Invoke
metacells.tools.project.compute_projection_weights()
just for this metacell, allowing the projection to use a secondary location in the atlas based on the residuals of the atlas metacells relative to the primary query projection.Invoke
metacells.tools.project.convey_atlas_to_query()
twice, once for the weights of the primary location and once for the weights of the secondary location, to obtain a primary and secondary type for the query metacell. If these have changed, repeat steps 13-14 (but do these steps no more than 3 times; note will will always do them twice as the 1st run will generate some non-empty secondary type).Invoke
metacells.tools.quality.compute_projected_folds()
andmetacells.tools.quality.compute_similar_query_metacells()
to update the projection and evaluation of the query metacell. If it is now similar, then use the results for the metacell; otherwise, keep the original results as they were at the end of step 10.
- metacells.pipeline.projection.outliers_projection_pipeline(what: str = '__x__', *, adata: AnnData, odata: AnnData, fold_regularization: float = 1e-05, project_min_significant_gene_umis: int = 40, reproducible: bool) None [source]¶
Project outliers on an atlas.
Returns
Sets the following in
odata
:Per-Observation (Cell) Annotations
atlas_most_similar
For each observation (outlier), the index of the “most similar” atlas metacell.
Per-Variable Per-Observation (Gene-Cell) Annotations
atlas_most_similar_fold
The fold factor between the outlier gene expression and their expression in the most similar atlas metacell (unless the value is too low to be of interest, in which case it will be zero).
Computation Parameters
Invoke
metacells.tools.quality.compute_outliers_matches()
using thefold_regularization
(default: 1e-05) andreproducible
.Invoke
metacells.tools.quality.compute_outliers_fold_factors()
using thefold_regularization
(default: 1e-05).
- metacells.pipeline.projection.write_projection_weights(path: str, adata: AnnData, qdata: AnnData, weights: CompressedMatrix) None [source]¶
Write into the
path
theweights
computed for the projection of the queryqdata
on the atlasadata
.Since the weights are (very) sparse, we just write them as a CSV file. This is also what
MCView
expect.
MCView¶
Compute metacell analysis in preparation for exporting the data to MCView.
- metacells.pipeline.mcview.compute_for_mcview(what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, adata: AnnData, gdata: AnnData, group: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'metacell', find_metacells_marker_genes: Dict[str, Any] | None = {}, compute_umap_by_markers_2: Dict[str, Any] | None = {}, compute_umap_by_markers_3: Dict[str, Any] | None = {}, compute_outliers_matches: Dict[str, Any] | None = {}, compute_deviant_folds: Dict[str, Any] | None = {}, compute_inner_folds: Dict[str, Any] | None = {}, count_significant_inner_folds: Dict[str, Any] | None = {}, compute_stdev_logs: Dict[str, Any] | None = {}, compute_var_var_similarity: Dict[str, Any] | None = {'bottom': 50, 'top': 50}, random_seed: int) None [source]¶
Compute metacell analysis in preparation for exporting the data to MCView.
This simply invokes a series of tools (which you can also invoke independently), using the default parameters, except for the
group
(default: metacell) which allows applying this to any clustering (not only metacells) and therandom_seed
(non-zero for reproducibility).If specific tool parameters need to be specified, you can pass them as a dictionary using the specific tool name (e.g.,
compute_umap_by_markers_2 = dict(spread = 0.5)
. If this parameter is set toNone
, running the tool is skipped.Input
Annotated
adata
, where the observations are cells and the variables are genes, wherewhat
is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.In addition,
gdata
is assumed to have one observation for each group, and use the same genes asadata
.Returns
Sets many properties in
gdata
and some inadata
, see below.Computation Parameters
Use the default parameters for each tool, unless a parameter with the same name provides specific parameter overrides for it. In addition, the special parameters
group
(default: metacell) andrandom_seed
, and thereproducible
flag derived from it (true if the seed is not zero) are automatically passed to all relevant tools.Compute the “marker” metacell genes using
metacells.tools.high.find_metacells_marker_genes()
.Computes UMAP projections by invoking
metacells.pipeline.umap.compute_umap_by_markers()
. This is done twice, once withdimensions=2
for visualization and once withdimensions=3
to capture more of the manifold structure (used to automatically generate cluster colors). Therefore in this case there are two dictionary parameterscompute_umap_by_markers_2
andcompute_umap_by_markers_3
.Compute for each outlier cell the “most similar” metacecell for it using
metacells.tools.quality.compute_outliers_matches()
.Compute for each gene and cell the fold factor between the gene’s expression and that of the metacell it belongs to (or, for outliers, the one it is most similar to) using
metacells.tools.quality.compute_deviant_folds()
.Compute for each gene for each metacell the maximal of the above using
metacells.tools.quality.compute_inner_folds()
.Compute for each gene the number of metacells where the above is high using
metacells.tools.quality.count_significant_inner_folds()
.Compute for each gene for each metacell the standard deviation of the log (base 2) of the fractions of each gene across the cells of the metacell using
metacells.tools.quality.compute_stdev_logs()
.Compute the gene-gene (variable-variable) similarity matrix. Note by default this will use {‘top’: 50, ‘bottom’: 50} which aren’t the normal defaults for
compute_var_var_similarity
, in order to keep just the top correlated genes and bottom (anti-)correlated genes for each gene. Otherwise you will get a dense matrix of ~X0K by ~X0K entries, which typically isn’t what you want.