Pipeline

Default metacell processing pipeline.

The functions here are thin wrappers which invoke a series of steps (metacells.tools) to provide a complete pipeline for computing metacells for your data.

This pipeline can be configured by tweaking the (many) parameters, but for any deeper customization (adding and/or removing steps) just provide your own pipeline instead. You can use the implementation here as a starting point.

All the functions included here are exported under metacells.pl.

metacells.pipeline.related_genes.relate_to_lateral_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, max_sampled_cells: int = 10000, genes_similarity_method: str = 'abs_pearson', genes_cluster_method: str = 'ward', max_genes_of_modules: int = 64, min_mean_gene_fraction: float = 1e-05, min_gene_correlation: float = 0.1, random_seed: int = 0) None[source]

Detect coarse relations between genes and lateral genes based on what (default: __x__) data.

This is a quick-and-dirty way to look for genes highly correlated with lateral genes.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix. The data should contain a lateral_gene mask containing some known-to-be lateral genes.

Returns

Variable-pair (Gene) Annotations
lateral_genes_similarity

The similarity between each genes related to lateral genes.

Variable (Gene) Annotations
lateral_genes_module

The index of the related to lateral gene module for each gene.

Computation Parameters

  1. If we have more than max_sampled_cells (default: 10000), pick this number of random cells from the data using the random_seed.

  2. Start by marking as “related genes” all the genes listed in lateral_gene.

  3. Pick as candidates genes whose mean fraction in the population is at least the (very low) min_mean_gene_fraction (default: 1e-05) (and that are not lateral genes).

  4. Compute the correlation between the related genes and the candidate genes. Move from the candidate genes to the lateral genes any genes whose correlation is at least min_gene_correlation (default: 0.1).

  5. Repeat step 4 until no additional related genes are found.

  6. Compute the similarity between the all related genes using metacells.tools.similarity.compute_var_var_similarity() using the genes_similarity_method (default: abs_pearson).

  7. Create a hierarchical clustering of the candidate genes using the genes_cluster_method (default: ward).

  8. Identify gene modules in the hierarchical clustering by bottom-up combining genes as long as we have at most max_genes_of_modules (default: 64). Note that this may leave modules with a smaller number of genes, and modules that do not contain any lateral genes.

Exclude

Raw single-cell RNA sequencing data is notoriously noisy and “dirty”. The pipeline steps here performs initial analysis of the data and exclude some of it, so it would not harm the metacells computation. The steps provided here are expected to be generically useful, but as always specific data sets may require custom cleaning steps on a case-by-case basis.

metacells.pipeline.exclude.exclude_genes(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, bursty_max_sampled_cells: int | None = 10000, bursty_downsample_min_samples: int = 750, bursty_downsample_min_cell_quantile: float = 0.5, bursty_downsample_max_cell_quantile: float = 0.05, bursty_min_gene_total: int = 100, bursty_min_gene_normalized_variance: float = 5.656854249492381, bursty_max_gene_similarity: float = 0.1, properly_sampled_min_gene_total: int | None = 1, excluded_gene_names: Collection[str] | None = None, excluded_gene_patterns: Collection[str | Pattern] | None = None, random_seed: int) None[source]

Exclude a subset of the genes from the metacells computation.

You can also just manually set the excluded_gene mask, or further manipulate it after calling this function.

Input

Annotated adata, where the observations are cells and the variables are genes.

Returns

Sets the following in the data:

Variable (gene) annotations:
properly_sampled_gene

A mask of the “properly sampled” genes.

excluded_gene

A mask of the genes which were excluded (by name or due to not being properly sampled).

Computation Parameters

  1. Invoke metacells.tools.bursty_lonely.find_bursty_lonely_genes() using bursty_max_sampled_cells (default: 10000), bursty_downsample_min_samples (default: 750), bursty_downsample_min_cell_quantile (default: 0.5), bursty_downsample_max_cell_quantile (default: 0.05), bursty_min_gene_total (default: 100), bursty_min_gene_normalized_variance (default: 5.656854249492381), bursty_max_gene_similarity (default: 0.1), and random_seed. Any “bursty_lonely_genes” will be excluded.

  2. Invoke metacells.tools.properly_sampled.find_properly_sampled_genes() using properly_sampled_min_gene_total (default: 1). Genes which are not properly sampled will be excluded.

  3. Invoke metacells.tools.named.find_named_genes() to also exclude genes based on their name, using the excluded_gene_names (default: None) and excluded_gene_patterns (default: None).

metacells.pipeline.exclude.exclude_cells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, properly_sampled_min_cell_total: int | None, properly_sampled_max_cell_total: int | None, properly_sampled_max_excluded_genes_fraction: float | None, additional_cells_masks: List[str] | None = None) None[source]

Exclude a subset of the cells from the metacells computation.

You can also just manually set the excluded_cell mask, or further manipulate it after calling this function.

Input

Annotated adata, where the observations are cells and the variables are genes. Optionally, may contain an excluded_gene mask of genes to be excluded from the metacells computation. That is, invoke this after calling exclude_genes() (if you wish to exclude any genes).

Returns

Sets the following in the full data:

Observation (cell) annotations:
properly_sampled_cell

A mask of the “properly sampled” cells.

excluded_cell

A mask of the genes which were excluded (inverse of properly_sampled_cell).

Computation Parameters

  1. Invoke metacells.tools.properly_sampled.find_properly_sampled_cells() using properly_sampled_min_cell_total (no default), properly_sampled_max_cell_total (no default) and properly_sampled_max_excluded_genes_fraction (no default).

  2. Exclude any cells which are not properly sampled (|~properly_sampled_cell), with optional additional following additional_cells_masks (using metacells.tools.mask.combine_masks()).

metacells.pipeline.exclude.extract_clean_data(adata: AnnData, *, name: str | None = '.clean', top_level: bool = True) AnnData | None[source]

Extract a “clean” subset of the adata to compute metacells for.

Input

Annotated adata, where the observations are cells and the variables are genes. This may contain per-gene excluded_gene and per-cell excluded_cell annotations.

Returns

Annotated sliced data containing the “clean” subset of the original data (that is, everything not marked as excluded).

The returned data will have full_cell_index and full_gene_index per-observation (cell) and per-variable (gene) annotations to allow mapping the results back to the original data.

Computation Parameters

  1. This simply metacells.tools.filter.filter_data() to slice just the genes not in the excluded_gene mask and the cells not in the excluded_cell mask, data using the name (default: .clean), and tracking the original full_cell_index and full_gene_index.

Mark (Genes)

Even when working with “clean” data, computing metacells requires marking genes for special treatment, specifically “lateral” genes which should not be selected for computing metacells and “noisy” genes which should be neither selected as nor be used to detect deviant (outlier) cells in the metacells.

Proper choice of lateral and noisy genes is mandatory for high-quality metacells generation. However, this choice can’t be fully automated, and therefore relies on the analyst to manually provide the list of genes.

metacells.pipeline.mark.mark_lateral_genes(adata: AnnData, *, lateral_gene_names: Collection[str] | None = None, lateral_gene_patterns: Collection[str | Pattern] | None = None, op: str = 'set') None[source]

Mark a subset of the genes as “lateral”, that is, prevent them from being selected for computing metacells.

Lateral genes are still used to detect outliers, that is, the cells in resulting metacells should still have similar expression level for such genes.

Depending on op, this will either set (override/create) the mask, add (to an existing mask), or remove (from an existing mask).

Input

Annotated adata, where the observations are cells and the variables are genes.

Returns

Sets the following in the data:

Variable (gene) annotations:
lateral_gene

A mask of the “lateral” genes.

Computation Parameters

  1. Invoke metacells.tools.named.find_named_genes() to also mark lateral genes based on their name, using the lateral_gene_names (default: None), lateral_gene_patterns (default: None) and op (default: set).

metacells.pipeline.mark.mark_noisy_genes(adata: AnnData, *, noisy_gene_names: Collection[str] | None = None, noisy_gene_patterns: Collection[str | Pattern] | None = None, op: str = 'set') None[source]

Mark a subset of the genes as “noisy”, that is, prevent them from being used for detecting deviant (outlier) cells.

Depending on op, this will either set (override/create) the mask, add (to an existing mask), or remove (from an existing mask).

Input

Annotated adata, where the observations are cells and the variables are genes.

Returns

Sets the following in the data:

Variable (gene) annotations:
noisy_gene

A mask of the “noisy” genes.

Computation Parameters

  1. Invoke metacells.tools.named.find_named_genes() to also mark noisy genes based on their name, using the noisy_gene_names (default: None), noisy_gene_patterns (default: None) and op (default: set).

metacells.pipeline.mark.mark_select_genes(adata: AnnData, *, select_gene_names: Collection[str] | None = None, select_gene_patterns: Collection[str | Pattern] | None = None, op: str = 'set') None[source]

Mark a subset of the genes as “select”, that is, force them to be used when computing metacell.

If this is done, then it overrides the default gene selection mechanism, forcing the algorithm to use a fixed set of genes. In general, this will result in lower-quality metacells, especially when using the divide-and-conquer algorithm, so do not use this unless you really know what you are doing.

Depending on op, this will either set (override/create) the mask, add (to an existing mask), or remove (from an existing mask).

Input

Annotated adata, where the observations are cells and the variables are genes.

Returns

Sets the following in the data:

Variable (gene) annotations:
select_gene

A mask of the “select” genes.

Computation Parameters

  1. Invoke metacells.tools.named.find_named_genes() to select genes based on their name, using the select_gene_names (default: None), select_gene_patterns (default: None) and op (default: set).

metacells.pipeline.mark.mark_ignored_genes(adata: AnnData, *, ignored_gene_names: Collection[str] | None = None, ignored_gene_patterns: Collection[str | Pattern] | None = None, ignored_gene_names_of_types: Dict[str, Collection[str]] | None = None, ignored_gene_patterns_of_types: Dict[str, Collection[str]] | None = None, op: str = 'set') None[source]

Mark a subset of the genes as “ignored”, that is, do not attempt to match them when projecting this (query) data onto an atlas.

Depending on op, this will either set (override/create) the mask(s), add (to an existing mask(s)), or remove (from an existing mask(s)).

Input

Annotated adata, where the observations are cells and the variables are genes.

Returns

Sets the following in the data:

Variable (gene) annotations:
ignored_gene

A mask of the “ignored” genes for all the query metacells regardless of their type.

ignored_gene_of_<type>

A mask of the “ignored” genes for query metacells that are assigned a specific type.

Computation Parameters

  1. Invoke metacells.tools.named.find_named_genes() to ignore genes based on their name, using the ignored_gene_names (default: None), ignored_gene_patterns (default: None) and op (default: set).

  2. Similarly for each type specified in ignored_gene_names_of_types and/or ignored_gene_patterns_of_types.

metacells.pipeline.mark.mark_essential_genes(adata: AnnData, *, essential_gene_names_of_types: Dict[str, Collection[str]] | None = None, essential_gene_patterns_of_types: Dict[str, Collection[str]] | None = None, op: str = 'set') None[source]

Mark a subset of the genes as “essential”, that is, require that (most of them) will be projected “well” to accept a query metacell as having the same type as the atlas projection.

Depending on op, this will either set (override/create) the mask(s), add (to an existing mask(s)), or remove (from an existing mask(s)).

Input

Annotated adata, where the observations are cells and the variables are genes.

Returns

Sets the following in the data:

Variable (gene) annotations:
essential_gene_of_<type>

A mask of the “essential” genes for query metacells to be assigned a specific type.

Computation Parameters

  1. Invoke metacells.tools.named.find_named_genes() to ignore genes based on their name, using the essential_gene_names_of_types (default: None), essential_gene_patterns_of_types (default: None) and op (default: set).

Selection

metacells.pipeline.select.extract_selected_data(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, name: str | None = '.select', downsample_min_samples: float = 750, downsample_min_cell_quantile: float = 0.05, downsample_max_cell_quantile: float = 0.5, min_gene_relative_variance: float | None, min_genes: int = 100, min_gene_total: int | None = 50, min_gene_top3: int | None = 4, additional_gene_masks: List[str] = ['&~lateral_gene'], random_seed: int, top_level: bool = True) AnnData[source]

Select a subset of what (default: __x__ data, to compute metacells by.

When computing metacells (or clustering cells in general), it makes sense to use a subset of the genes for computing cell-cell similarity, for both technical (e.g., too low an expression level) and biological (e.g., ignoring bookkeeping and cell cycle genes) reasons. The steps provided here are expected to be generically useful, but as always specific data sets may require custom gene selection steps on a case-by-case basis.

Input

A presumably “clean” Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Will obey the following annotations in the full adata, if they exist:

Variable (Gene) Annotations

select_gene

If exists, force a mask of genes to use as “select” genes, ignoring everything else.

lateral_gene

A boolean mask of genes which are lateral from being chosen as “select” genes based on their name.

Returns

Returns annotated sliced data containing the “select” subset of the original data. By default, the name of this data is .select. If no selects were selected, return None.

Also sets the following annotations in the full adata:

Unstructured Annotations
downsample_samples

The target total number of samples in each downsampled cell.

Observation-Variable (Cell-Gene) Annotations:
downsampled

The downsampled data where the total number of samples in each cell is at most downsample_samples.

Variable (Gene) Annotations
high_total_gene

A boolean mask of genes with “high” expression level (unless a select_gene mask exists).

high_relative_variance_gene

A boolean mask of genes with “high” normalized variance, relative to other genes with a similar expression level (unless a select_gene mask exists).

selected_gene

A boolean mask of the actually selected genes.

Computation Parameters

  1. If a select_gene mask exists, just use these genes and go directly to the last step 6.

  2. Invoke metacells.tools.downsample.downsample_cells() to downsample the cells to the same total number of UMIs, using the downsample_min_samples (default: 750), downsample_min_cell_quantile (default: 0.05), downsample_max_cell_quantile (default: 0.5) and the random_seed (non-zero for reproducible results).

  3. Invoke metacells.tools.high.find_high_total_genes() to select high-expression genes (based on the downsampled data), using min_gene_total.

  4. Invoke metacells.tools.high.find_high_relative_variance_genes() to select high-variance genes (based on the downsampled data), using min_gene_relative_variance.

  5. Compute the set of genes that pass the above test, as well as match the additional_gene_masks (default: [’&~lateral_gene’]).

  6. If we found less than min_genes genes (default: 100, and min_gene_relative_variance was specified, try to achieve the required minimal number of genes by reducing the min_gene_relative_variance. In extreme cases, give up on the relative variance requirement altogether.

  7. Invoke metacells.tools.filter.filter_data() to slice just the selected genes.

Direct

metacells.pipeline.direct.compute_direct_metacells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, select_downsample_min_samples: int = 750, select_downsample_min_cell_quantile: float = 0.05, select_downsample_max_cell_quantile: float = 0.5, select_min_gene_total: int | None = 50, select_min_gene_top3: int | None = 4, select_min_gene_relative_variance: float | None = 0.1, select_min_genes: int = 100, cells_similarity_value_regularization: float = 0.14285714285714285, cells_similarity_log_data: bool = True, cells_similarity_method: str = 'abs_pearson', target_metacell_size: int = 48, min_metacell_size: int = 12, target_metacell_umis: int = 160000, cell_umis: ndarray | None = None, knn_k: int | None = None, knn_k_size_factor: float = 2, knn_k_umis_quantile: float = 0.1, min_knn_k: int | None = 30, knn_balanced_ranks_factor: float = 3.1622776601683795, knn_incoming_degree_factor: float = 3.0, knn_outgoing_degree_factor: float = 1.0, knn_min_outgoing_degree: int = 2, min_seed_size_quantile: float = 0.85, max_seed_size_quantile: float = 0.95, candidates_cooldown_pass: float = 0.02, candidates_cooldown_node: float = 0.25, candidates_cooldown_phase: float = 0.75, candidates_min_split_size_factor: float = 2.0, candidates_max_merge_size_factor: float = 0.5, candidates_max_split_min_cut_strength: float = 0.1, candidates_min_cut_seed_cells: int = 7, must_complete_cover: bool = False, deviants_policy: str = 'gaps', deviants_min_gene_fold_factor: float = 3.0, deviants_gap_skip_cells: int = 1, deviants_min_noisy_gene_fold_factor: float = 2.0, deviants_max_gene_fraction: float = 0.03, deviants_max_cell_fraction: float | None = 0.25, deviants_max_gap_cells_count: int = 3, deviants_max_gap_cells_fraction: float = 0.1, dissolve_min_robust_size_factor: float = 0.5, dissolve_min_convincing_gene_fold_factor: float | None = 3.0, random_seed: int) None[source]

Directly compute metacells using what (default: __x__) data.

This directly computes the metacells on the whole data. Like any method that directly looks at the whole data at once, the amount of CPU and memory needed becomes unreasonable when the data size grows. Above O(10,000) you are much better off using the divide-and-conquer method.

Note

The current implementation is naive in that it computes the full dense N^2 correlation matrix, and only then extracts the sparse graph out of it. We actually need two copies where each requires 4 bytes per entry, so for O(100,000) cells, we have storage of O(100,000,000,000). In addition, the implementation is serial for the graph clustering phases.

It is possible to mitigate this by fusing the correlations phase and the graph generation phase, parallelizing the result, and also (somehow) parallelizing the graph clustering phase. This might increase the “reasonable” size for the direct approach to O(100,000).

We have decided not to invest in this direction since it won’t allow us to push the size to O(1,000,000) and above. Instead we provide the divide-and-conquer method, which easily scales to O(1,000,000) on a single multi-core server, and to “unlimited” size if we further enhance the implementation to use a distributed compute cluster of such servers.

Input

The presumably “clean” annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Sets the following annotations in adata:

Unstructured Annotations
downsample_samples

The target total number of samples in each downsampled cell.

Observation-Variable (Cell-Gene) Annotations:
downsampled

The downsampled data where the total number of samples in each cell is at most downsample_samples.

Variable (Gene) Annotations
high_total_gene

A boolean mask of genes with “high” expression level (unless a select_gene mask exists).

high_relative_variance_gene

A boolean mask of genes with “high” normalized variance, relative to other genes with a similar expression level (unless a select_gene mask exists).

selected_gene

A boolean mask of the actually selected genes.

Observation (Cell) Annotations
metacell

The integer index of the metacell each cell belongs to. The metacells are in no particular order. Cells with no metacell assignment (“outliers”) are given a metacell index of -1.

Computation Parameters

  1. If cell_umis is not specified, use the sum of the what data for each cell.

  2. Invoke metacells.pipeline.select.extract_selected_data() to extract “select” data from the clean data, using the select_downsample_min_samples (default: 750), select_downsample_min_cell_quantile (default: 0.05), select_downsample_max_cell_quantile (default: 0.5), select_min_gene_total (default: 50), select_min_gene_top3 (default: 4), select_min_gene_relative_variance (default: 0.1) and random_seed.

  3. Compute the fractions of each variable in each cell, and add the cells_similarity_value_regularization (default: 0.14285714285714285) to it.

  4. If cells_similarity_log_data (default: True), invoke the metacells.utilities.computation.log_data() function to compute the log (base 2) of the data.

  5. Invoke metacells.tools.similarity.compute_obs_obs_similarity() to compute the similarity between each pair of cells, using the cells_similarity_method (default: abs_pearson).

  6. Invoke metacells.tools.knn_graph.compute_obs_obs_knn_graph() to compute a K-Nearest-Neighbors graph, using the knn_balanced_ranks_factor (default: 3.1622776601683795), knn_incoming_degree_factor (default: 3.0) knn_outgoing_degree_factor (default: 1.0) and knn_min_outgoing_degree (default: 2)(. If knn_k (default: None) is not specified, then it is chosen to be the knn_k_size_factor (default: 2 times the median number of cells required to reach the target metacell size, or the knn_k_umis_quantile (default: 0.1) the number of cells required, or min_knn_k (default: 30), whichever is largest.

  7. Invoke metacells.tools.candidates.compute_candidate_metacells() to compute the candidate metacells, using the min_seed_size_quantile (default: 0.85), max_seed_size_quantile (default: 0.95), candidates_cooldown_pass (default: 0.02), candidates_cooldown_node (default: 0.25), candidates_cooldown_phase (default: 0.75), candidates_min_split_size_factor (default: 2.0), candidates_max_merge_size_factor (default: 0.5), and random_seed. This tries to build metacells of the target_metacell_size (default: 48), min_metacell_size (default: 12), and target_metacell_umis (default: 160000) using the cell_umis (default: None).

  8. Unless must_complete_cover (default: False), invoke metacells.tools.deviants.find_deviant_cells() to remove deviants from the candidate metacells, using the deviants_policy (default: gaps), deviants_min_gene_fold_factor (default: 3.0), deviants_gap_skip_cells (default: 1), deviants_min_noisy_gene_fold_factor (default: 2.0), deviants_max_gene_fraction (default: 0.03) and deviants_max_cell_fraction (default: 0.25).

  1. Unless must_complete_cover (default: False), invoke metacells.tools.dissolve.dissolve_metacells() to dissolve small unconvincing metacells, using the same target_metacell_size (default: 48), min_metacell_size (default: 12), target_metacell_umis (default: 160000), cell_umis (default: None) and the dissolve_min_robust_size_factor (default: 0.5).

Divide and Conquer

metacells.pipeline.divide_and_conquer.set_max_parallel_piles(max_parallel_piles: int) None[source]

Set the (maximal) number of piles to compute in parallel.

By default, we use all the available hardware threads. Override this by setting the METACELLS_MAX_PARALLEL_PILES environment variable or by invoking this function from the main thread.

A value of 0 will use all the available processors (see metacells.utilities.parallel.set_processors_count()). Otherwise, the value is the positive maximal number of processors to use in parallel for computing piles.

It may be useful to restrict the number of parallel piles to restrict the total amount of memory used by the application, to keep it within the physical RAM available.

metacells.pipeline.divide_and_conquer.get_max_parallel_piles() int[source]

Return the maximal number of piles to compute in parallel.

metacells.pipeline.divide_and_conquer.guess_max_parallel_piles(cells: AnnData, what: str = '__x__', *, max_gbs: float = -0.1, target_metacell_umis: int = 160000, cell_umis: ndarray | None = None, target_metacell_size: int = 48, min_target_pile_size: int = 8000, max_target_pile_size: int = 24000, target_metacells_in_pile: int = 100) int[source]

Try and guess a reasonable maximal number of piles to use for computing metacells for the specified cells using at most max_gbs of memory (default: -0.1, that is 90% of all the machine has - if zero or negative, is relative to the machines memory).

The amount of memory used depends on the target pile size, so give this function the same parameters as for metacells.pipeline.divide_and_conquer.divide_and_conquer_pipeline() so it will use the same automatically adjusted target pile size.

Note

This is only a best-effort guess. A too-low number would slow down the computation by using less processors than it could. A too-high number might cause the computation to crash, running out of memory. So: use with care, YMMV, keep an eye on memory usage and other applications running in parallel to the computation, and apply common sense.

Computation Parameters

  1. Figure out the target pile size by invoking metacells.pipeline.divide_and_conquer.compute_target_pile_size() using the target_metacell_umis (default: 160000), cell_umis (default: None), target_metacell_size (default: 48), min_target_pile_size (default: 8000), max_target_pile_size (default: 24000) and target_metacells_in_pile (default: 100).

  2. Use psutil.virtual_memory to query the amount of memory in the system, and apply max_gbs (default: -0.1) off the top (e.g., -0.1 means reduce it by 10%).

  3. Guesstimate the number of piles that, if computed in parallel, will consume this amount of memory.

metacells.pipeline.divide_and_conquer.compute_target_pile_size(adata: AnnData, what: str = '__x__', *, cell_umis: ndarray | None = None, target_metacell_size: int = 48, target_metacell_umis: int = 160000, min_target_pile_size: int = 8000, max_target_pile_size: int = 24000, target_metacells_in_pile: int = 100) int[source]

Compute how many cells should be in each divide-and-conquer pile.

Larger piles are slower to process (O(N^2)) than smaller piles, but (up to a point) they allow the algorithm to produce better results. The ideal pile size is just big enough to allow for a sufficient number of metacells to be created; specifically, we’d like the number of metacells in each pile to be roughly the same as the number of cells in a metacell, which is natural for the divide and conquer algorithm.

Note that the target pile size is just a target. It is directly used in the 1st phase of the divide-and-conquer algorithm, but the pile sizes of the 2nd phase are determined by clustering metacells together which inevitably cause the actual pile sizes to vary around this number.

Computation Parameters

  1. If cell_umis is not specified, use the sum of the what data for each cell.

  2. The natural pile size is the product of the target_metacell_size (default: 48) and the target_metacells_in_pile (default: 100).

  3. Also consider the mean number of cells needed to reach the target_metacell_umis (default: 160000); this times the target_metacells_in_pile gives us another pile size, and if this one is larger than what we computed in the previous step, we use it instead.

  4. Clamp this value to be no lower than min_target_pile_size (default: 8000) and no higher than max_target_pile_size (default: 24000). That is, if you set both to the same value, this will force the target pile size value.

metacells.pipeline.divide_and_conquer.divide_and_conquer_pipeline(adata: AnnData, what: str = '__x__', *, rare_max_genes: int = 500, rare_max_gene_cell_fraction: float = 0.001, rare_min_gene_maximum: int = 7, rare_genes_similarity_method: str = 'repeated_pearson', rare_genes_cluster_method: str = 'ward', rare_min_genes_of_modules: int = 4, rare_min_cells_of_modules: int = 12, rare_min_module_correlation: float = 0.1, rare_min_related_gene_fold_factor: float = 7, rare_max_related_gene_increase_factor: float = 4.0, rare_min_cell_module_total: int = 4, rare_max_cells_factor_of_random_pile: float = 0.5, rare_deviants_max_cell_fraction: float | None = 0.25, rare_dissolve_min_robust_size_factor: float | None = 0.5, rare_dissolve_min_convincing_gene_fold_factor: float = 3.0, quick_and_dirty: bool = False, select_downsample_min_samples: int = 750, select_downsample_min_cell_quantile: float = 0.05, select_downsample_max_cell_quantile: float = 0.5, select_min_gene_total: int | None = 50, select_min_gene_top3: int | None = 4, select_min_gene_relative_variance: float | None = 0.1, select_min_genes: int = 100, cells_similarity_value_regularization: float = 0.14285714285714285, cells_similarity_log_data: bool = True, cells_similarity_method: str = 'abs_pearson', groups_similarity_log_data: bool = True, groups_similarity_method: str = 'abs_pearson', target_metacell_umis: int = 160000, cell_umis: ndarray | None = None, target_metacell_size: int = 48, min_metacell_size: int = 12, target_metacells_in_pile: int = 100, min_target_pile_size: int = 8000, max_target_pile_size: int = 24000, piles_knn_k_size_factor: float = 3, piles_min_split_size_factor: float = 1.5, piles_min_robust_size_factor: float = 0.3, piles_max_merge_size_factor: float = 0.15, knn_k: int | None = None, knn_k_umis_quantile: float = 0.1, min_knn_k: int | None = 30, knn_balanced_ranks_factor: float = 3.1622776601683795, knn_incoming_degree_factor: float = 3.0, knn_outgoing_degree_factor: float = 1.0, knn_min_outgoing_degree: int = 2, min_seed_size_quantile: float = 0.85, max_seed_size_quantile: float = 0.95, candidates_knn_k_size_factor: float = 2, candidates_cooldown_pass: float = 0.02, candidates_cooldown_node: float = 0.25, candidates_cooldown_phase: float = 0.75, candidates_min_split_size_factor: float = 2.0, candidates_max_merge_size_factor: float = 0.5, candidates_max_split_min_cut_strength: float | None = 0.1, candidates_min_cut_seed_cells: int = 7, must_complete_cover: bool = False, deviants_policy: str = 'gaps', deviants_gap_skip_cells: int = 1, deviants_min_gene_fold_factor: float = 3.0, deviants_min_noisy_gene_fold_factor: float = 2.0, deviants_max_gene_fraction: float | None = 0.03, deviants_max_cell_fraction: float | None = 0.25, deviants_max_gap_cells_count: int = 3, deviants_max_gap_cells_fraction: float = 0.1, dissolve_min_robust_size_factor: float | None = 0.5, dissolve_min_convincing_gene_fold_factor: float = 3.0, random_seed: int) None[source]

Complete pipeline using divide-and-conquer to compute the metacells for the what (default: __x__) data.

If a progress_bar is active, progress will be reported into the current (slice of) the progress bar. The detection of rare gene modules is not counted in the progress bar; instead it is logged at the INFO level.

Note

This is applicable to “any size” data. If the data is “small” (O(10,000)), it will revert to using the direct metacell computation (but will still by default first look for rare gene modules). If the data is “large” (up to O(10,000,000)), this will be much faster and will require much less memory than using the direct approach. The current implementation is not optimized for “huge” data (O(1,000,000,000)) - that is, it will work, and keep will use a limited amount of memory, but a faster implementation would distribute the computation across multiple servers.

Input

The presumably “clean” annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Sets the following annotations in adata:

Everything set by compute_divide_and_conquer_metacells; and in addition:

Variable (Gene) Annotations
rare_gene_module_<N>

A boolean mask for the genes in the rare gene module N.

rare_gene

A boolean mask for the genes in any of the rare gene modules.

Observations (Cell) Annotations
cells_rare_gene_module

The index of the rare gene module each cell expresses the most, or -1 in the common case it does not express any rare genes module.

rare_cell

A boolean mask for the (few) cells that express a rare gene module.

Computation Parameters

  1. If cell_umis is not specified, use the sum of the what data for each cell.

  2. Invoke metacells.pipeline.divide_and_conquer.compute_target_pile_size() using target_metacell_umis (default: 160000), cell_umis (default: None), target_metacell_size (default: 48), min_target_pile_size (default: 8000), max_target_pile_size (default: 24000) and target_metacells_in_pile (default: 100).

  3. Invoke metacells.tools.rare.find_rare_gene_modules() to isolate cells expressing rare gene modules, using the rare_max_genes (default: 500), rare_max_gene_cell_fraction (default: 0.001), rare_min_gene_maximum (default: 7), rare_genes_similarity_method (default: repeated_pearson), rare_genes_cluster_method (default: ward), rare_min_genes_of_modules (default: 4), rare_min_cells_of_modules (default: 12), rare_min_module_correlation (default: 0.1), rare_min_related_gene_fold_factor (default: 7) rare_max_related_gene_increase_factor (default: 4.0) rare_max_cells_factor_of_random_pile (default: 0.5) and rare_min_cell_module_total (default: 4). Use a non-zero random_seed to make this reproducible.

  4. For each detected rare gene module, collect all cells that express the module, and apply metacells.pipeline.direct.compute_direct_metacells() to them.

  5. Collect all the outliers from the above together with the rest of the cells (which express no rare gene module) and apply the compute_divide_and_conquer_metacells() to the combined result. The annotations for rare-but-outlier cells (e.g. cell_directs) will not reflect the work done when computing the rare gene module metacells which they were rejected from.

metacells.pipeline.divide_and_conquer.compute_divide_and_conquer_metacells(adata: AnnData, what: str = '__x__', *, quick_and_dirty: bool = False, select_downsample_min_samples: int = 750, select_downsample_min_cell_quantile: float = 0.05, select_downsample_max_cell_quantile: float = 0.5, select_min_gene_total: int | None = 50, select_min_gene_top3: int | None = 4, select_min_gene_relative_variance: float | None = 0.1, select_min_genes: int = 100, cells_similarity_value_regularization: float = 0.14285714285714285, cells_similarity_log_data: bool = True, cells_similarity_method: str = 'abs_pearson', groups_similarity_log_data: bool = True, groups_similarity_method: str = 'abs_pearson', target_metacell_umis: int = 160000, cell_umis: ndarray | None = None, target_metacell_size: int = 48, min_metacell_size: int = 12, min_target_pile_size: int = 8000, max_target_pile_size: int = 24000, target_metacells_in_pile: int = 100, piles_knn_k_size_factor: float = 3, piles_min_split_size_factor: float = 1.5, piles_min_robust_size_factor: float = 0.3, piles_max_merge_size_factor: float = 0.15, knn_k: int | None = None, knn_k_umis_quantile: float = 0.1, min_knn_k: int | None = 30, knn_balanced_ranks_factor: float = 3.1622776601683795, knn_incoming_degree_factor: float = 3.0, knn_outgoing_degree_factor: float = 1.0, knn_min_outgoing_degree: int = 2, min_seed_size_quantile: float = 0.85, max_seed_size_quantile: float = 0.95, candidates_cooldown_pass: float = 0.02, candidates_cooldown_node: float = 0.25, candidates_cooldown_phase: float = 0.75, candidates_knn_k_size_factor: float = 2, candidates_min_split_size_factor: float = 2.0, candidates_max_merge_size_factor: float = 0.5, candidates_max_split_min_cut_strength: float | None = 0.1, candidates_min_cut_seed_cells: int = 7, must_complete_cover: bool = False, deviants_policy: str = 'gaps', deviants_gap_skip_cells: int = 1, deviants_min_gene_fold_factor: float = 3.0, deviants_min_noisy_gene_fold_factor: float = 2.0, deviants_max_gene_fraction: float | None = 0.03, deviants_max_cell_fraction: float | None = 0.25, deviants_max_gap_cells_count: int = 3, deviants_max_gap_cells_fraction: float = 0.1, dissolve_min_robust_size_factor: float | None = 0.5, dissolve_min_convincing_gene_fold_factor: float = 3.0, random_seed: int, _fdata: AnnData | None = None, _counts: Dict[str, int] | None = None) None[source]

Compute metacells for what (default: {what}) data using the divide-and-conquer method.

This divides large data to smaller “piles” and directly computes metacells for each, then generates new piles of “similar” metacells and directly computes the final metacells from each such pile. Due to this divide-and-conquer approach, the total amount of memory required is bounded (by the pile size), and the amount of total CPU grows slowly with the problem size, allowing this method to be applied to very large data (millions of cells).

If a progress bar is active, progress will be reported into the current (slice of) the progress bar.

Input

The presumably “clean” annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Sets the following annotations in adata:

Observations (Cell) Annotations
metacell

The integer index of the final metacell each cell belongs to. The metacells are in no particular order. This is -1 for cells not included in any metacell (outliers, excluded).

dissolve

A boolean mask of the cells which were in a dissolved metacell (in the last pile they were in).

Computation Parameters

  1. If cell_umis is not specified, use the sum of the what data for each cell.

  2. Invoke metacells.pipeline.divide_and_conquer.compute_target_pile_size() using target_metacell_umis (default: {target_metacell_umis}), cell_umis (default: {cell_umis}), target_metacell_size (default: {target_metacell_size}), min_target_pile_size (default: {min_target_pile_size}), max_target_pile_size (default: {max_target_pile_size}) and target_metacells_in_pile (default: {target_metacells_in_pile}).

  3. If the data is smaller than the target_pile_size times the piles_min_split_size_factor (default: {piles_min_split_size_factor}), then just invoke metacells.pipeline.direct.compute_direct_metacells() using the parameters. Otherwise, perform the following steps.

  4. Phase 0 (preliminary): group the cells randomly into equal-sized piles of roughly the target_pile_size using the random_seed (default: {random_seed}) to allow making this replicable. Otherwise, use the groups of metacells computed in the preliminary phase to create the piles of the final phase.

  5. Compute metacells using the piles by invoking metacells.pipeline.direct.compute_direct_metacells(), possibly in parallel.

  6. Collect the outliers from all the piles and recursively compute metacells for them. Continue to compute metacells for the outliers-of-outliers, etc. until all the outliers fit in a single pile; then compute metacells for this outlier-most cells with must_complete_cover=True.

  7. Phase 0 to 1: Recursively run the algorithm grouping the preliminary metacells into groups-of-metacells, using piles_knn_k_size_factor (default: {piles_knn_k_size_factor}), piles_min_split_size_factor (default: {piles_min_split_size_factor}), piles_min_robust_size_factor (default: {piles_min_robust_size_factor}), and piles_max_merge_size_factor (default: {piles_max_merge_size_factor}).

  8. Phase 1 (final): group the cells into piles based on the groups-of-metacells. Compute metacells using the piles, by invoking metacells.pipeline.direct.compute_direct_metacells(), possibly in parallel.

  9. Collect the outliers from all the piles and compute metacells for them (possibly in parallel). If quick_and_dirty, stop. Otherwise, continue to group all the cells in metacells by grouping the outliers-of-outliers similarly to step 5.

  10. Phase 1 to 2: Recursively run the algorithm grouping the outlier metacells into groups-of-metacells, using piles_knn_k_size_factor (default: {piles_knn_k_size_factor}), piles_min_split_size_factor (default: {piles_min_split_size_factor}), piles_min_robust_size_factor (default: {piles_min_robust_size_factor}), and piles_max_merge_size_factor (default: {piles_max_merge_size_factor}).

  11. Phase 2 (outliers): Group the cells into piles based on the groups-of-metacells. Compute metacells using the piles, by invoking metacells.pipeline.direct.compute_direct_metacells(), possibly in parallel. Stop here and accept all outliers as final outliers (that is, there are no level 2 metacells).

Collect

metacells.pipeline.collect.collect_metacells(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, metacell_geo_mean: bool = True, metacell_umis_regularization: float = 0.0625, zeros_cell_size_quantile: float = 0.1, groups: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'metacell', name: str = 'metacells', prefix: str | None = 'M', top_level: bool = True, _metacell_groups: bool = False, random_seed: int) AnnData[source]

Collect computed metacells what (default: {what}) data.

Input

Annotated (presumably “clean”) adata, where the observations are cells and the variables are genes, and where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

Returns

Annotated metacell data containing for each observation (metacell) for each variable (gene) the fraction of the gene in the metacell, and the following annotations:

Unstructured (Scalar)
outliers

The number of outlier cells (which were not assigned to any metacell).

metacells_algorithm

The version of the algorithm generating the metacell (based on the package version number).

Variable (Gene) Annotations
*

Every per-gene annotations of adata is copied to the result.

Observations (Metacell) Annotations
grouped

The number of (“clean”) cells grouped into each metacell.

total_umis

The total of all the UMIs of all the genes of all the cells grouped into the metacell.

Observations-Variables (Metacell-Gene) Annotations
total_umis

The total of all the UMIs of each genes in all the cells grouped into the metacell.

Also sets in the full adata:

Observations (Cell) Annotations
metacell_name

The string name of the metacell, which is prefix (default: {prefix}) followed by the metacell index, followed by .checksum where the checksum is a 2 digits, reflecting the (names of the) set of cells grouped into each metacell. This keeps metacell names short, and provides protection against assigning per-metacell annotations from a different metacell computation to the new results (unless they are identical).

Computation Parameters

For each metacell:

  1. If metacell_geo_mean (default: {metacell_geo_mean}),

    1. Compute the fraction of each gene out of each cell grouped into the metacell.

    2. For each cell, add to the fractions the metacell_umis_regularization divided by the total UMIs of the cell.

    3. For each gene, compute the weighted geomean of these fractions across all the cells, where the weight of each cell is the log of its total number of UMIs, and

    4. Subtract the geomean of the per-cell regularization so all-zero genes would have a zero fraction.

  2. Otherwise, for each metacell, sum the total UMIs of each gene across all cells, and divide it by the total UMIs of all genes in all cells.

  3. Normalize the per-gene fractions so their sum would be 1.0 in the metacell.

  4. Compute the zeros_cell_size_quantile of the total UMIs of the cells to pick the number of UMIs to use for zeros computation.

  5. For each gene in each cell, compute the probability it would be zero if we downsampled the cell to have this number of UMIs.

  6. Use these probabilities to decide whether the each gene is “effectively zero” in each cell.

  7. For each gene, count the number of cells in which it is “effectively zero”.

In addition:

  1. Pass all relevant per-gene annotations from adata to the result.

  2. Set the metacell_name property of each cell in adata to the name of the group (metacell) it belongs to. Cells which do not belong to any metacell are assigned the metacell name Outliers.

UMAP

It is useful to project the metacells data to a 2D scatter plot (where each point is a metacell). The steps provided here are expected to yield a reasonable such projection, though as always you might need to tweak the parameters or even the overall flow for specific data sets.

metacells.pipeline.umap.compute_knn_by_markers(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, marker_gene_names: Collection[str] | None = None, marker_gene_patterns: Collection[str | Pattern] | None = None, max_marker_genes: int | None = 1000, ignore_lateral_genes: bool = True, similarity_value_regularization: float = 1e-05, similarity_log_data: bool = True, similarity_method: str = 'logistics_abs_pearson', logistics_location: float = 0.8, logistics_slope: float = 0.5, k: int, balanced_ranks_factor: float = 3.1622776601683795, incoming_degree_factor: float = 3.0, outgoing_degree_factor: float = 1.0, min_outgoing_degree: int = 1, reproducible: bool) PandasFrame[source]

Compute KNN graph between metacells based on marker genes.

If reproducible is True, a slower (still parallel) but reproducible algorithm will be used to compute Pearson correlations.

Input

Annotated adata where each observation is a metacells and the variables are genes, are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix. Should contain a marker_gene mask unless explicitly specifying the marker genes.

Returns

Sets the following in adata:

Observations-Pair (Metacells) Annotations
obs_outgoing_weights

A sparse square matrix where each non-zero entry is the weight of an edge between a pair of cells or genes, where the sum of the weights of the outgoing edges for each element is 1 (there is always at least one such edge).

Also return a pandas data frame of the similarities between the observations (metacells).

Computation Parameters

  1. If marker_gene_names and/or marker_gene_patterns were specified, use the matching genes. Otherwise, use the marker_gene mask.

  2. If ignore_lateral_genes (default: True), then remove any genes marked as lateral from the mask.

  3. If max_marker_genes (default: 1000) is not None, then pick this number of marker genes with the highest variance.

  4. Compute the fractions of each marker_gene in each cell, and add the similarity_value_regularization (default: 1e-05) to it.

  5. If similarity_log_data (default: True), invoke the metacells.utilities.computation.log_data() function to compute the log (base 2) of the data.

  6. Invoke metacells.tools.similarity.compute_obs_obs_similarity() using similarity_method (default: logistics_abs_pearson), logistics_location (default: 0.5) and logistics_slope (default: 0.5) and convert this to distances.

  7. Invoke metacells.tools.knn_graph.compute_obs_obs_knn_graph() using the distances, k (no default!), balanced_ranks_factor (default: 3.1622776601683795), incoming_degree_factor (default: 3.0), outgoing_degree_factor (default: 1.0) and min_outgoing_degree (default: 1) to compute a “skeleton” graph to overlay on top of the UMAP graph.

metacells.pipeline.umap.compute_umap_by_markers(adata: AnnData, what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, marker_gene_names: Collection[str] | None = None, marker_gene_patterns: Collection[str | Pattern] | None = None, max_marker_genes: int | None = 1000, ignore_lateral_genes: bool = True, similarity_value_regularization: float = 1e-05, similarity_log_data: bool = True, similarity_method: str = 'logistics_abs_pearson', logistics_location: float = 0.8, logistics_slope: float = 0.5, skeleton_k: int = 4, balanced_ranks_factor: float = 3.1622776601683795, incoming_degree_factor: float = 3.0, outgoing_degree_factor: float = 1.0, min_outgoing_degree: int = 1, umap_k: int = 15, dimensions: int = 2, min_dist: float = 0.5, spread: float = 1.0, random_seed: int) None[source]

Compute a UMAP projection of the (meta)cells.

Input

Annotated adata where each observation is a metacells and the variables are genes, are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix. Should contain a marker_gene mask unless explicitly specifying the marker genes.

Returns

Sets the following annotations in adata:

Observation-Observation (Metacell-Metacell) Annotations
umap_distances

A sparse symmetric matrix of the graph of distances between the metacells.

Observation (Metacell) Annotations
x, y (if dimensions is 2)

The X and Y coordinates of each metacell in the UMAP projection.

u, v, w (if dimensions is 3)

The U, V, W coordinates of each metacell in the UMAP projection.

Computation Parameters

  1. Invoke metacells.pipeline.umap.compute_knn_by_markers() using marker_gene_names (default: None), marker_gene_patterns (default: None), max_marker_genes (default: 1000), ignore_lateral_genes (default: True), similarity_value_regularization (default: 1e-05), similarity_log_data (default: True), similarity_method (default: logistics_abs_pearson), logistics_location (default: 0.8), logistics_slope (default: 0.5), skeleton_k (default: 4), balanced_ranks_factor (default: 3.1622776601683795), incoming_degree_factor (default: 3.0) and outgoing_degree_factor (default: 1.0) and min_outgoing_degree (default: 1) to compute a “skeleton” graph to overlay on top of the UMAP graph. Specify a non-zero random_seed to make this reproducible.

  2. Invoke metacells.tools.layout.umap_by_distances() using the distances, umap_k (default: 15), min_dist (default: 0.5), spread (default: 1.0), dimensions (default: 2) and random_seed.

Note

Keep in mind that the KNN graph used by UMAP (controlled by umap_k) is not identical to the KNN graph we compute (controlled by skeleton_k). By default, we choose skeleton_k < umap_k, as the purpose of the skeleton KNN is to highlight the “strong” structure of the data; in practice this strong skeleton is highly compatible with the structure used by UMAP, so it serves it purpose reasonably well. It would have been nice to make these compatible, but UMAP is not friendly towards dictating a KNN graph from the outside.

Projection

metacells.pipeline.projection.projection_pipeline(what: str = '__x__', *, adata: AnnData, qdata: AnnData, only_atlas_marker_genes: bool = True, only_query_marker_genes: bool = False, ignore_atlas_lateral_genes: bool = True, ignore_query_lateral_genes: bool = True, consider_atlas_noisy_genes: bool = True, consider_query_noisy_genes: bool = True, misfit_min_metacells_fraction: float = 0.5, project_log_data: bool = True, project_fold_regularization: float = 1e-05, project_candidates_count: int = 50, project_min_candidates_fraction: float = 0.3333333333333333, project_min_significant_gene_umis: int = 40, project_min_usage_weight: float = 1e-05, project_filter_ranges: bool = True, project_ignore_range_quantile: float = 0.02, project_ignore_range_min_overlap_fraction: float = 0.5, project_min_query_markers_fraction: float = 0.3333333333333333, project_max_consistency_fold_factor: float = 2.0, project_max_projection_fold_factor: float = 3.0, project_max_projection_noisy_fold_factor: float = 2.0, project_max_misfit_genes: int = 3, project_min_essential_genes_fraction: float | None = 0.75, atlas_type_property_name: str = 'type', project_corrections: bool = False, project_min_corrected_gene_correlation: float = 0.8, project_min_corrected_gene_factor: float = 0.15, reproducible: bool, top_level_parallel: bool = True) CompressedMatrix[source]

Complete pipeline for projecting query metacells onto an atlas of metacells for the what (default: __x__) data.

Input

Annotated query qdata and atlas adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix, containing the fraction of each gene in each cell. The atlas should also contain a type per-observation (metacell) annotation.

The qdata may include per-gene masks, ignored_gene and ignored_gene_of_<type>, which force the code below to ignore the marked genes from either the preliminary projection or the following refined type-specific projections.

Returns

A matrix whose rows are query metacells and columns are atlas metacells, where each entry is the weight of the atlas metacell in the projection of the query metacells. The sum of weights in each row (that is, for a single query metacell) is 1. The weighted sum of the atlas metacells using these weights is the “projected” image of the query metacell onto the atlas.

Variable (Gene) Annotations
atlas_gene

A mask of the query genes that also exist in the atlas. We match genes by their name; if projecting query data from a different technology, we expect the caller to modify the query gene names to match the atlas before projecting it.

atlas_lateral_gene, atlas_noisy_gene, atlas_marker_gene, essential_gene_of_<type>

Copied from the atlas to the query (False for non-atlas_gene).

projected_noisy_gene

The mask of the genes that were considered “noisy” when computing the projection. By default this is the union of the noisy atlas and query genes.

correction_factor (if project_corrections)

If projecting a query on an atlas with different technologies (e.g., 10X v3 to 10X v2), an automatically computed factor we multiplied the query gene fractions by to compensate for the systematic difference between the technologies (1.0 for uncorrected genes and 0.0 for non-atlas_gene).

fitted_gene_of_<type>

For each type, the genes that were projected well from the query to the atlas for most cells of that type; any atlas_gene outside this mask failed to project well from the query to the atlas for most metacells of this type. For non-atlas_gene this is set to False.

misfit_gene_of_<type>

For each query metacell type, a boolean mask indicating whether the gene has a strong bias in the query metacells of this type compared to the atlas metacells of this type.

ignored_gene_of_<type>

For each query metacell type, a boolean mask indicating whether the gene was ignored by the projection (for any reason) when computing the projection for metacells of this type.

Observation (Cell) Annotations
total_atlas_umis

The total UMIs of the atlas_gene in each query metacell. This is used in the analysis as described for total_umis above, that is, to ensure comparing expression levels will ignore cases where the total number of UMIs of both compared gene profiles is too low to make a reliable determination. In such cases we take the fold factor to be 0.

projected_type

For each query metacell, the best atlas type we can assign to it based on its projection. Note this does not indicate that the query metacell is “truly” of this type; to make this determination one needs to look at the quality control data below.

projected_secondary_type

In some cases, a query metacell may fail to project well to a single region of the atlas, but does project well to a combination of two distinct atlas regions. This may be due to the query metacell containing doublets, of a mixture of cells which match different atlas regions (e.g. due to sparsity of data in the query data set). Either way, if this happens, we place here the type that best describes the secondary region the query metacell was projected to; otherwise this would be the empty string. Note that the weights matrix above does not distinguish between the regions.

projected_correlation per query metacell

The correlation between between the corrected_fraction and the projected_fraction for only the fitted_gene expression levels of each query metacell. This serves as a very rough estimator for the quality of the projection for this query metacell (e.g. can be used to compute R^2 values).

In general we expect high correlation (more than 0.9 in most metacells) since we restricted the fitted_gene mask only to genes we projected well.

similar mask per query metacell

A conservative determination of whether the query metacell is “similar” to its projection on the atlas. This is based on whether the number of misfit_gene for the query metacell is low enough (by default, up to 3 genes), and also that at least 75% of the essential_gene of the query metacell were not misfit_gene. Note that this explicitly allows for a projected_secondary_type, that is, a metacell of doublets will be “similar” to the atlas, but a metacell of a novel state missing from the atlas will be “dissimilar”.

The final determination of whether to accept the projection is, as always, up to the analyst, based on prior biological knowledge, the context of the collection of the query (and atlas) data sets, etc. The analyst need not (indeed, should not) blindly accept the similar determination without examining the rest of the quality control data listed above.

Observation-Variable (Cell-Gene) Annotations
corrected_fraction per gene per query metacell

For each atlas_gene, its fraction in each query metacell, out of only the atlas genes. This may be further corrected (see below) if projecting between different scRNA-seq technologies (e.g. 10X v2 and 10X v3). For non-atlas_gene this is 0.

projected_fraction per gene per query metacell

For each atlas_gene, its fraction in its projection on the atlas. This projection is computed as a weighted average of some atlas metacells (see below), which are all sufficiently close to each other (in terms of gene expression), so averaging them is reasonable to capture the fact the query metacell may be along some position on some gradient that isn’t an exact match for any specific atlas metacell. For non-atlas_gene this is 0.

fitted mask per gene per query metacell

For each atlas_gene for each query metacell, whether the gene was expected to be projected well, based on the query metacell projected_type (and the projected_secondary_type, if any). For non-atlas_gene this is set to False. This does not guarantee the gene was actually projected well.

misfit

For each atlas_gene for each query metacell, whether the corrected_fraction of the gene was significantly different from the projected_fractions (that is, whether the gene was not projected well for this metacell). For non-atlas_gene this is set to False, to make it easier to identify problematic genes.

This is expected to be rare for fitted_gene and common for the rest of the atlas_gene. If too many fitted_gene are also misfit_gene, then one should be suspicious whether the query metacell is “truly” of the projected_type.

essential

Which of the atlas_gene were also listed in the essential_gene_of_<type> for the projected_type (and also the projected_secondary_type, if any) of each query metacell.

If an essential_gene is also a misfit_gene, then one should be very suspicious whether the query metacell is “truly” of the projected_type.

projected_fold per gene per query metacell

The fold factor between the corrected_fraction and the projected_fraction (0 for non-atlas_gene). If the absolute value of this is high (3 for 8x ratio) then the gene was not projected well for this metacell. This will be 0 for non-atlas_gene.

It is expected this would have low values for most fitted_gene and high values for the rest of the atlas_gene, but specific values will vary from one query metacell to another. This allows the analyst to make fine-grained determination about the quality of the projection, and/or identify quantitative differences between the query and the atlas (e.g., when studying perturbed systems such as knockouts or disease models).

Computation Parameters

  1. Find the subset of genes that exist in both the query and the atlas. All computations will be done on this common subset. Normalize the fractions of the fitted gene fractions to sum to 1 in each metacell.

Compute preliminary projection:

  1. Compute a mask of fitted genes, ignoring any genes included in ignored_gene. If only_atlas_marker_genes (default: only_atlas_marker_genes), ignore any non-marker_gene of the atlas. If only_query_marker_genes (default: only_query_marker_genes), ignore any non-marker_gene of the query. If ignore_atlas_lateral_genes (default: True), ignore the lateral_gene of the atlas. If ignore_query_lateral_genes (default: True), ignore the lateral_gene of the atlas. Normalize the fractions of the fitted gene fractions to sum to 1 in each metacell.

  2. Invoke metacells.tools.project.compute_projection_weights() to project each query metacell onto the atlas, using the project_log_data (default: True), project_fold_regularization (default: 1e-05), project_min_significant_gene_umis (default: 40), project_max_consistency_fold_factor (default: 2.0), project_candidates_count (default: 50), project_min_usage_weight (default: 1e-05), and reproducible.

  3. If project_corrections (default: False: Correlate the expression levels of each gene between the query and projection. If this is at least project_min_corrected_gene_correlation (default: 0.8), compute the ratio between the mean expression of the gene in the projection and the query. If this is at most 1/(1+``project_min_corrected_gene_factor``) or at least (1+``project_min_corrected_gene_factor``) (default: 0.15), then multiply the gene’s value by this factor so its level would match the atlas. As usual, ignore genes which do not have at least project_min_significant_gene_umis. If any genes were corrected, then repeat steps 2-3 (but do these steps no more than 3 times).

  4. If project_filter_ranges (default: True): Compute for each gene its expression range (lowest and highest project_ignore_range_quantile (default: 0.02) in both the projected and the corrected values. Compute the overlap between these ranges (shared range divided by the query range). If this is less than project_ignore_range_min_overlap_fraction (default: 0.5), then ignore the gene. If any genes were ignored, repeat steps 2-4 (but do this step no more than 3 times).

  5. Invoke metacells.tools.project.convey_atlas_to_query() to assign a projected type to each of the query metacells based on the atlas_type_property_name (default: type).

Then, for each type of query metacells:

If top_level_parallel (default: top_level_parallel), do this in parallel. This seems to work better than doing this serially and using parallelism within each type. However this is still inefficient compared to using both types of parallelism at once, which the code currently can’t do without non-trivial coding (this would have been trivial in Julia…).

  1. Further reduce the mask of type-specific fitted genes by ignoring any genes in ignored_gene_of_<type>, if this annotation exists in the query. Normalize the sum of the fitted gene fractions to 1 in each metacell.

  2. Invoke metacells.tools.project.compute_projection_weights() to project each query metacell of the type onto the atlas. Note that even though we only look at query metacells (tentatively) assigned the type, their projection on the atlas may use metacells of any type.

  3. Invoke metacells.tools.quality.compute_projected_folds() to compute the significant fold factors between the query and its projection.

  4. Identify type-specific misfit genes whose fold factor is above project_max_projection_fold_factor. If consider_atlas_noisy_genes and/or consider_query_noisy_genes, then any gene listed in either is allowed an additional project_max_projection_noisy_fold_factor. If any gene has such a high fold factor in at least misfit_min_metacells_fraction, remove it from the fitted genes mask and repeat steps 6-9.

  5. Invoke metacells.tools.quality.compute_similar_query_metacells() to verify which query metacells ended up being sufficiently similar to their projection, using project_max_consistency_fold_factor (default: 2.0), project_max_projection_noisy_fold_factor (default: 2.0), project_max_misfit_genes (default: 3), and if needed, the the essential_gene_of_<type>. Also compute the correlation between the corrected and projected gene fractions for each metacell.

And then:

  1. Invoke metacells.tools.project.convey_atlas_to_query() to assign an updated projected type to each of the query metacells based on the atlas_type_property_name (default: type). If this changed the type assigned to any query metacell, repeat steps 6-11 (but do this step no more than 3 times).

For each query metacell that ended up being dissimilar, try to project them as a combination of two atlas regions:

  1. Reduce the list of fitted genes for the metacell based on the ignored_gene_of_<type> for both the primary (initially from the above) query metacell type and the secondary query metacell type (initially empty). Normalize the sum of the gene fractions in the metacell to 1.

  2. Invoke metacells.tools.project.compute_projection_weights() just for this metacell, allowing the projection to use a secondary location in the atlas based on the residuals of the atlas metacells relative to the primary query projection.

  3. Invoke metacells.tools.project.convey_atlas_to_query() twice, once for the weights of the primary location and once for the weights of the secondary location, to obtain a primary and secondary type for the query metacell. If these have changed, repeat steps 13-14 (but do these steps no more than 3 times; note will will always do them twice as the 1st run will generate some non-empty secondary type).

  4. Invoke metacells.tools.quality.compute_projected_folds() and metacells.tools.quality.compute_similar_query_metacells() to update the projection and evaluation of the query metacell. If it is now similar, then use the results for the metacell; otherwise, keep the original results as they were at the end of step 10.

metacells.pipeline.projection.outliers_projection_pipeline(what: str = '__x__', *, adata: AnnData, odata: AnnData, fold_regularization: float = 1e-05, project_min_significant_gene_umis: int = 40, reproducible: bool) None[source]

Project outliers on an atlas.

Returns

Sets the following in odata:

Per-Observation (Cell) Annotations

atlas_most_similar

For each observation (outlier), the index of the “most similar” atlas metacell.

Per-Variable Per-Observation (Gene-Cell) Annotations

atlas_most_similar_fold

The fold factor between the outlier gene expression and their expression in the most similar atlas metacell (unless the value is too low to be of interest, in which case it will be zero).

Computation Parameters

  1. Invoke metacells.tools.quality.compute_outliers_matches() using the fold_regularization (default: 1e-05) and reproducible.

  2. Invoke metacells.tools.quality.compute_outliers_fold_factors() using the fold_regularization (default: 1e-05).

metacells.pipeline.projection.write_projection_weights(path: str, adata: AnnData, qdata: AnnData, weights: CompressedMatrix) None[source]

Write into the path the weights computed for the projection of the query qdata on the atlas adata.

Since the weights are (very) sparse, we just write them as a CSV file. This is also what MCView expect.

MCView

Compute metacell analysis in preparation for exporting the data to MCView.

metacells.pipeline.mcview.compute_for_mcview(what: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, adata: AnnData, gdata: AnnData, group: str | ndarray | Collection[int] | Collection[float] | PandasSeries = 'metacell', find_metacells_marker_genes: Dict[str, Any] | None = {}, compute_umap_by_markers_2: Dict[str, Any] | None = {}, compute_umap_by_markers_3: Dict[str, Any] | None = {}, compute_outliers_matches: Dict[str, Any] | None = {}, compute_deviant_folds: Dict[str, Any] | None = {}, compute_inner_folds: Dict[str, Any] | None = {}, count_significant_inner_folds: Dict[str, Any] | None = {}, compute_stdev_logs: Dict[str, Any] | None = {}, compute_var_var_similarity: Dict[str, Any] | None = {'bottom': 50, 'top': 50}, random_seed: int) None[source]

Compute metacell analysis in preparation for exporting the data to MCView.

This simply invokes a series of tools (which you can also invoke independently), using the default parameters, except for the group (default: metacell) which allows applying this to any clustering (not only metacells) and the random_seed (non-zero for reproducibility).

If specific tool parameters need to be specified, you can pass them as a dictionary using the specific tool name (e.g., compute_umap_by_markers_2 = dict(spread = 0.5). If this parameter is set to None, running the tool is skipped.

Input

Annotated adata, where the observations are cells and the variables are genes, where what is a per-variable-per-observation matrix or the name of a per-variable-per-observation annotation containing such a matrix.

In addition, gdata is assumed to have one observation for each group, and use the same genes as adata.

Returns

Sets many properties in gdata and some in adata, see below.

Computation Parameters

  1. Use the default parameters for each tool, unless a parameter with the same name provides specific parameter overrides for it. In addition, the special parameters group (default: metacell) and random_seed, and the reproducible flag derived from it (true if the seed is not zero) are automatically passed to all relevant tools.

  2. Compute the “marker” metacell genes using metacells.tools.high.find_metacells_marker_genes().

  3. Computes UMAP projections by invoking metacells.pipeline.umap.compute_umap_by_markers(). This is done twice, once with dimensions=2 for visualization and once with dimensions=3 to capture more of the manifold structure (used to automatically generate cluster colors). Therefore in this case there are two dictionary parameters compute_umap_by_markers_2 and compute_umap_by_markers_3.

  4. Compute for each outlier cell the “most similar” metacecell for it using metacells.tools.quality.compute_outliers_matches().

  5. Compute for each gene and cell the fold factor between the gene’s expression and that of the metacell it belongs to (or, for outliers, the one it is most similar to) using metacells.tools.quality.compute_deviant_folds().

  6. Compute for each gene for each metacell the maximal of the above using metacells.tools.quality.compute_inner_folds().

  7. Compute for each gene the number of metacells where the above is high using metacells.tools.quality.count_significant_inner_folds().

  8. Compute for each gene for each metacell the standard deviation of the log (base 2) of the fractions of each gene across the cells of the metacell using metacells.tools.quality.compute_stdev_logs().

  9. Compute the gene-gene (variable-variable) similarity matrix. Note by default this will use {‘top’: 50, ‘bottom’: 50} which aren’t the normal defaults for compute_var_var_similarity, in order to keep just the top correlated genes and bottom (anti-)correlated genes for each gene. Otherwise you will get a dense matrix of ~X0K by ~X0K entries, which typically isn’t what you want.