Utilities¶

Generic utilities used by the metacells code.

Arguably all(most) of these belong in more general package(s).

All the functions included here are exported under metacells.ut.

Annotation¶

In general we are using AnnData to hold the data being analyzed. However, the interface of AnnData leaves some things out which are crucial for the proper working of our algorithm (and any other algorithm that works at a scale of millions of cells).

X as an Annotation¶

For a uniform interface, we pretend the X member is a per-variable-per-observation annotation with the special name __x__. This allows us to have APIs that take an annotation name and just pass them (typically by default) the annotation “name” __x__ to force the code to run on the X data member.

In general the APIs allow specifying either annotation names or alternatively an explicit matrix (or vector for per-observation or per-variable annotations), for maximal usage flexibility.

Data Types¶

The generic AnnData is cheerfully permissive when it comes to the data it contains. That is, when accessing data, it isn’t clear whether you’ll be getting a numpy array or a pandas data series, and for 2D data you might be getting all sort of data types (including sparse matrices of various formats).

Python itself is very loose about the interface these data types provide - some operations such as len and shape and accessing an element by integer indices are safe, more advanced operations can silently produce the wrong results, and most operations work on a subset of the data types, often with wildly incompatible interfaces.

To combat this, we have the metacells.utilities.typing module which imposes some order on the types zoo, and, in addition, we provide here accessor functions which return deterministic usable data types, allowing for safe processing of the results. This is combined with the py:mod:metacells.utilities.computation module which provides a set of operations that work consistently on the few data types we use.

Data Layout¶

A related issue is the layout of 2D data. For small matrices, this doesn’t matter, but when dealing with large matrices (millions of rows/columns), performing a simple operation may takes orders of magnitude longer if applied to a matrix of the wrong layout.

To make things worse the builtin functions for converting between matrix layouts are pretty inefficient so more efficient variants are provided in the py:mod:metacells.utilities.computation module.

The accessors in this module allow for explicitly controlling the layout of the data they return, and cache the different layouts of the same annotations of the AnnData (under the reasonable assumption that the original data is not modified). This allows for writing guaranteed-to-be-efficient processing code.

Data Logging¶

A side benefit of exclusively using the accessors provided here is that they participate in the automated logging provided by the metacells.utilities.logging module. That is, using them will automatically log writing the final results of a computation to the user at the INFO log level, while higher logging level give insight into the exact data being read and written by the algorithm’s nested sub-steps.

Return new annotated data which includes a subset of the full adata.

If name is not specified, the data will be unnamed. Otherwise, if it starts with a ., it will be appended to the current name (if any). Otherwise, name is the new name.

If obs and/or vars are specified, they should be set to either a boolean mask or a collection of indices to include in the data slice. In the case of an indices array, it is assumed the indices are unique and sorted, that is that their effect is similar to a mask.

If track_obs and/or track_var are specified, the result slice will include a per-observation and/or per-variable annotation containing the indices of the sliced elements in the original full data.

If the slice happens to be the full original data, then this becomes equivalent to copy_adata(), and by default this will share_derived (share the derived data cache).

metacells.utilities.annotation.copy_adata(adata: AnnData, *, name: str | None = None, share_derived: bool = True, top_level: bool = True) → AnnData[source]¶

Return a copy of some annotated adata.

If name is not specified, the data will be unnamed. Otherwise, if it starts with a ., it will be appended to the current name (if any). Otherwise, name is the new name.

If share_derived is True (the default), then the copy will share the derived data cache, which contains specific layout variants of matrix data and sums of columns/rows of matrix data. Use this if you intend to modify the copy in-place.

Note

In general we assume annotated data is not modified in-place, but it might make sense to create a copy (not sharing derived data), modify it immediately (before accessing data in a specific layout), and then proceed to process it without further modifications.

metacells.utilities.annotation.set_name(adata: AnnData, name: str | None) → None[source]¶

Set the name of the data (for log messages).

If the name starts with . it is appended to the current name, if any.

metacells.utilities.annotation.get_name(adata: AnnData, default: str | None = None) → str | None[source]¶

Return the name of the data (for log messages), if any.

If no name was set, returns the default.

metacells.utilities.annotation.set_m_data(adata: AnnData, name: str, data: Any, *, formatter: Callable[[Any], Any] | None = None) → Any[source]¶

Set unstructured data.

If formatter is specified, its results is used when logging the operation.

metacells.utilities.annotation.get_m_data(adata: AnnData, name: str, *, formatter: Callable[[Any], Any] | None = None) → Any[source]¶: Get metadata (unstructured annotation) in adata by its name.

metacells.utilities.annotation.set_o_data(adata: AnnData, name: str, data: ndarray, *, formatter: Callable[[Any], Any] | None = None) → Any[source]¶

Set per-observation (cell) data.

If formatter is specified, its results is used when logging the operation.

Get per-observation (cell) data in adata by its name as a pandas series.

If name is a string, it is the name of a per-observation annotation to fetch. Otherwise, it should be some vector of data of the appropriate size.

Get per-observation (cell) data in adata by its name as a Numpy array.

If name is a string, it is the name of a per-observation annotation to fetch. Otherwise, it should be some vector of data of the appropriate size.

If sum is True, then name should be the name of a per-observation-per-variable annotation, or a matrix, and this will return the sum (per row) of this data.

metacells.utilities.annotation.get_o_names(adata: AnnData) → ndarray[source]¶: Get a numpy vector of observation names.

metacells.utilities.annotation.maybe_o_numpy(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries | None, *, sum: bool = False, formatter: Callable[[Any], Any] | None = None) → ndarray | None[source]¶: Similar to get_o_numpy(), but if name is None, return None.

metacells.utilities.annotation.set_v_data(adata: AnnData, name: str, data: ndarray, *, formatter: Callable[[Any], Any] | None = None) → Any[source]¶

Set per-variable (gene) data.

If formatter is specified, its results is used when logging the operation.

Get per-variable (gene) data in adata by its name as a pandas series.

If name is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some vector of data of the appropriate size.

Get per-variable (gene) data in adata by its name as a numpy array.

If name is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some vector of data of the appropriate size.

metacells.utilities.annotation.get_v_names(adata: AnnData) → ndarray[source]¶: Get a numpy vector of variable names.

metacells.utilities.annotation.maybe_v_numpy(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries | None, *, sum: bool = False, formatter: Callable[[Any], Any] | None = None) → ndarray | None[source]¶: Similar to get_v_numpy(), but if name is None, return None.

metacells.utilities.annotation.set_oo_data(adata: AnnData, name: str, data: ndarray | CompressedMatrix, *, formatter: Callable[[Any], Any] | None = None) → Any[source]¶

Set per-observation-per-observation (cell) data.

If formatter is specified, its results is used when logging the operation.

Get per-observation-per-observation (per-cell-per-cell) data as a pandas data frame.

If name is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some matrix of data of the appropriate size.

If layout (default: {layout}) is specified, it must be one of row_major or column_major. If this requires relayout of the data, the result is cached in a hidden data member for future reuse.

Note

Since Pandas frames do not play well with sparse representations, this will always return the data as a dense matrix. For very large data, this may consume much more memory, so use with care.

metacells.utilities.annotation.get_oo_proper(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) → ndarray | CompressedMatrix[source]¶: Same as get_oo_data but returns a metacells.utilities.typing.ProperMatrix.

metacells.utilities.annotation.set_vv_data(adata: AnnData, name: str, data: ndarray | CompressedMatrix, *, formatter: Callable[[Any], Any] | None = None) → Any[source]¶

Set per-variable-per-variable (gene) data.

If formatter is specified, its results is used when logging the operation.

Get per-variable-per-variable (per-gene-per-gene) data as a pandas data frame.

If name is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some matrix of data of the appropriate size.

If layout (default: {layout}) is specified, it must be one of row_major or column_major. If this requires relayout of the data, the result is cached in a hidden data member for future reuse.

Note

Since Pandas frames do not play well with sparse representations, this will always return the data as a dense matrix. For very large data, this may consume much more memory, so use with care.

metacells.utilities.annotation.get_vv_proper(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) → ndarray | CompressedMatrix[source]¶: Same as get_vv_data but returns a metacells.utilities.typing.ProperMatrix.

metacells.utilities.annotation.set_oa_data(adata: AnnData, name: str, data: ndarray | CompressedMatrix, *, formatter: Callable[[Any], Any] | None = None) → Any[source]¶

Set per-observation-per-any (cell) data.

If formatter is specified, its results is used when logging the operation.

Get per-observation-per-any (per-cell-per-any) data as a pandas data frame.

Rows are observations (cells), indexed by the observation names (typically cell barcode). Columns are “something” - specify columns to specify an index.

If name is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some matrix of data of the appropriate size.

If layout (default: {layout}) is specified, it must be one of row_major or column_major. If this requires relayout of the data, the result is cached in a hidden data member for future reuse.

Note

Since Pandas frames do not play well with sparse representations, this will always return the data as a dense matrix. For very large data, this may consume much more memory, so use with care.

metacells.utilities.annotation.get_oa_proper(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) → ndarray | CompressedMatrix[source]¶: Same as get_oa_data but returns a metacells.utilities.typing.ProperMatrix.

metacells.utilities.annotation.set_va_data(adata: AnnData, name: str, data: ndarray | CompressedMatrix, *, formatter: Callable[[Any], Any] | None = None) → Any[source]¶

Set per-variable-per-any (gene) data.

If formatter is specified, its results is used when logging the operation.

Get per-variable-per-any (per-cell-per-any) data as a pandas data frame.

Rows are variables (genes), indexed by their names. Columns are “something” - specify columns to specify an index.

If name is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some matrix of data of the appropriate size.

If layout (default: {layout}) is specified, it must be one of row_major or column_major. If this requires relayout of the data, the result is cached in a hidden data member for future reuse.

Note

Since Pandas frames do not play well with sparse representations, this will always return the data as a dense matrix. For very large data, this may consume much more memory, so use with care.

metacells.utilities.annotation.get_va_proper(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) → ndarray | CompressedMatrix[source]¶: Same as get_va_data but returns a metacells.utilities.typing.ProperMatrix.

metacells.utilities.annotation.set_vo_data(adata: AnnData, name: str, data: ndarray | CompressedMatrix, *, formatter: Callable[[Any], Any] | None = None) → Any[source]¶: Set per-variable-per-observation (per-gene-per-cell) data.

Get per-variable-per-observation (per-gene-per-cell) data as a pandas data frame.

Rows are observations (cells), indexed by the observation names (typically cell barcode). Columns are variables (genes), indexed by their names.

If name is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some matrix of data of the appropriate size.

If layout (default: {layout}) is specified, it must be one of row_major or column_major. If this requires relayout of the data, the result is cached in a hidden data member for future reuse.

Note

Since Pandas frames do not play well with sparse representations, this will always return the data as a dense matrix. For very large data, this may consume much more memory, so use with care.

metacells.utilities.annotation.get_vo_proper(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) → ndarray | CompressedMatrix[source]¶: Same as get_vo_data but returns a metacells.utilities.typing.ProperMatrix.

metacells.utilities.annotation.has_data(adata: AnnData, name: str, layout: str | None = None) → bool[source]¶

Test whether we have the specified data.

If the data is per-variable-per-observation, and layout is specified (one of row_major and column_major), it returns whether the specific data layout is available in the cache, without having to re-layout existing data.

Computation¶

Most of the functions defined here are thin wrappers around builtin numpy or scipy functions, with others wrapping C++ extensions provided as part of the metacells package itself.

The key distinction of the functions here is that they provide a uniform interface for all the supported metacells.utilities.typing.Matrix and metacells.utilities.typing.Vector types, which makes them safe to use in our code without worrying about the exact data type used. In theory, Python duck-typing should have provided this out of the box, but it seems that without explicit types and interfaces, the interfaces of the different types diverge to the point where this just doesn’t work.

All the functions here (optionally) also allow collecting timing information using metacells.utilities.timing, to make it easier to locate the performance bottleneck of the analysis pipeline.

metacells.utilities.computation.allow_inefficient_layout(allow: bool) → bool[source]¶

Specify whether to allow processing using an inefficient layout.

Returns the previous setting.

This is True by default, which merely warns when an inefficient layout is used. Otherwise, processing an inefficient layout is treated as an error (raises an exception).

metacells.utilities.computation.to_layout(matrix: CompressedMatrix, layout: str, *, symmetric: bool = False) → CompressedMatrix[source]¶

metacells.utilities.computation.to_layout(matrix: ndarray, layout: str, *, symmetric: bool = False) → ndarray

metacells.utilities.computation.to_layout(matrix: PandasFrame | SparseMatrix, layout: str, *, symmetric: bool = False) → ndarray | CompressedMatrix

Return the matrix in a specific layout for efficient processing.

That is, if layout is column_major, re-layout the matrix for efficient per-column (variable, gene) slicing/processing. For sparse matrices, this is csc format; for dense matrices, this is Fortran (column-major) format.

Similarly, if layout is row_major, re-layout the matrix for efficient per-row (observation, cell) slicing/processing. For sparse matrices, this is csr format; for dense matrices, this is C (row-major) format.

If the matrix is already in the correct layout, it is returned as-is.

If the matrix is symmetric (default: False), it must be square and is assumed to be equal to its own transpose. This allows converting it from one layout to another using the efficient (essentially zero-cost) transpose operation.

Otherwise, a new copy is created in the proper layout. This is a costly operation as it needs to move all the data elements to their proper place. This uses a C++ extension to deal with compressed data (the builtin implementation is much slower). Even so this operation is costly; still, it makes the following processing much more efficient, so it is typically a net performance gain overall.

metacells.utilities.computation.sort_compressed_indices(matrix: CompressedMatrix, force: bool = False) → None[source]¶

Efficient parallel sort of indices in a CSR/CSC matrix.

This will skip sorting a matrix that is marked as sorted, unless force is specified.

metacells.utilities.computation.corrcoef(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None, reproducible: bool) → ndarray[source]¶

Similar to for numpy.corrcoef, but also works for a sparse matrix, and can be reproducible regardless of the number of cores used (at the cost of some slowdown). It only works for matrices with a float or double element data type.

If reproducible, a slower (still parallel) but reproducible algorithm will be used.

Unlike numpy.corrcoef, if given a row with identical values, instead of complaining about division by zero, this will report a zero correlation. This makes sense for the intended usage of computing similarities between cells/genes - an all-zero row has no data so we declare it to be “not similar” to anything else.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

Note

The result is always dense, as even for sparse data, the correlation is rarely exactly zero.

metacells.utilities.computation.cross_corrcoef_rows(first_matrix: ndarray, second_matrix: ndarray, *, reproducible: bool) → ndarray[source]¶

Similar to for numpy.corrcoef, but computes the correlations between each row of the first_matrix and each row of the second_matrix. The result matrix contains one row per row of the first matrix and one column per row of the second matrix. Both matrices must be dense, in row-major layout, have the same (float or double) element data type, and contain the same number of columns.

If reproducible, a slower (still parallel) but reproducible algorithm will be used.

Unlike numpy.corrcoef, if given a row with identical values, instead of complaining about division by zero, this will report a zero correlation. This makes sense for the intended usage of computing similarities between cells/genes - an all-zero row has no data so we declare it to be “not similar” to anything else.

Note

This only works for floating-point matrices.

metacells.utilities.computation.pairs_corrcoef_rows(first_matrix: ndarray, second_matrix: ndarray, *, reproducible: bool) → ndarray[source]¶

Similar to for numpy.corrcoef, but computes the correlations between each row of the first_matrix and each matching row of the second_matrix. Both matrices must be dense, in row-major layout, have the same (float or double) element data type, and the same shape.

If reproducible, a slower (still parallel) but reproducible algorithm will be used.

Unlike numpy.corrcoef, if given a row with identical values, instead of complaining about division by zero, this will report a zero correlation. This makes sense for the intended usage of computing similarities between cells/genes - an all-zero row has no data so we declare it to be “not similar” to anything else.

Note

This only works for floating-point matrices.

metacells.utilities.computation.logistics(matrix: ndarray, *, location: float, slope: float, per: str | None) → ndarray[source]¶

Compute a matrix of distances between each pair of rows in a dense (float or double) matrix using the logistics function.

The raw value of the logistics distance between a pair of vectors x and y is the mean of 1/(1+exp(-slope*(abs(x[i]-y[i])-location))). This has a minimum of 1/(1+exp(slope*location)) for identical vectors and an (asymptotic) maximum of 1. We normalize this to a range between 0 and 1, to be useful as a distance measure (with a zero distance between identical vectors).

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

Note

The result is always dense, as even for sparse data, the result is rarely exactly zero.

metacells.utilities.computation.cross_logistics_rows(first_matrix: ndarray, second_matrix: ndarray, *, location: float, slope: float) → ndarray[source]¶: Similar to for logistics(), but computes the distances between each row of the first_matrix and each row of the second_matrix. The result matrix contains one row per row of the first matrix and one column per row of the second matrix. Both matrices must be dense, in row-major layout, have the same (float or double) element data type, and contain the same number of columns.

metacells.utilities.computation.pairs_logistics_rows(first_matrix: ndarray, second_matrix: ndarray, *, location: float, slope: float) → ndarray[source]¶: Similar to for logistics(), but computes the distances between each row of the first_matrix and each matching row of the second_matrix. Both matrices must be dense, in row-major layout, have the same (float or double) element data type, and the same shape.

metacells.utilities.computation.log_data(shaped: S, *, base: float | None = None, normalization: float = 0) → S[source]¶

Return the log of the values in the shaped data.

If base is specified (default: None), use this base log. Otherwise, use the natural logarithm.

The normalization (default: 0) specifies how to deal with zeros in the data:

If it is zero, an input zero will become an output NaN.
If it is positive, it is added to the input before computing the log.
If it is negative, input zeros will become log(minimal positive value) + normalization, that is, the zeros will be given a value this much smaller than the minimal “real” log value.

Note

The result is always dense, as even for sparse data, the log is rarely zero.

metacells.utilities.computation.median_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Compute the mean value per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.mean_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Compute the mean value per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.nanmean_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Compute the mean value per (row or column) of some matrix, ignoring None values, if any.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.geomean_per(matrix: ndarray, *, per: str | None) → ndarray[source]¶

Compute the geometric mean value per (row or column) of some (dense) matrix (of non-zero values).

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.max_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Compute the maximal value per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.nanmax_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Compute the maximal value per (row or column) of some matrix, ignoring None values, if any.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.min_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Compute the minimal value per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.nanmin_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Compute the minimal value per (row or column) of some matrix, ignoring None values, if any.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.nnz_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Compute the number of non-zero values per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

Note

If given a sparse matrix, this returns the number of structural non-zeros, that is, the number of entries we actually store data for, even if this data is zero. Use metacells.utilities.typing.eliminate_zeros() if you suspect the sparse matrix of containing structural zero data values.

metacells.utilities.computation.sum_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Compute the total of the values per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.sum_squared_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Compute the total of the squared values per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.rank_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, rank: int, *, per: str | None) → ndarray[source]¶

Get the rank element per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.top_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, top: int, *, per: str | None, ranks: bool = False) → CompressedMatrix[source]¶

Get the top elements per (row or column) of some matrix, as a compressed per-major matrix.

If ranks (default: False), then fill the result with the rank of each element; Otherwise, just keep the original value.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.prune_per(compressed: CompressedMatrix, top: int) → CompressedMatrix[source]¶: Keep just the top elements of some compressed matrix, per row for CSR and per column for CSC.

metacells.utilities.computation.quantile_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, quantile: float, *, per: str | None) → ndarray[source]¶

Get the quantile element per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.nanquantile_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, quantile: float, *, per: str | None) → ndarray[source]¶

Get the quantile element per (row or column) of some matrix, ignoring None values.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.scale_by(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, scale: ndarray | Collection[int] | Collection[float] | PandasSeries, *, by: str) → ndarray | CompressedMatrix[source]¶: Return a matrix where each by (row or column) is scaled by the matching value of the vector.

Return a matrix containing, in each entry, the fraction of the original data out of the total by (row or column).

That is, the sum of by in the result will be 1. However, if sums is specified, it is used instead of the sum of each by, so the sum of the results may be different.

Note

This assumes all the data values are non-negative.

metacells.utilities.computation.fraction_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Get the fraction per (row or column) out of the total of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.stdev_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Get the standard deviantion per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.variance_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) → ndarray[source]¶

Get the variance per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.normalized_variance_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None, zero_value: float = 1.0) → ndarray[source]¶

Get the normalized variance (variance / mean) per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

If all the values are zero, writes the zero_value (default: {zero_value}) into the result.

metacells.utilities.computation.relative_variance_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None, window_size: int) → ndarray[source]¶

Return the (log2(normalized_variance) - median(log2(normalized_variance_of_similar)) of the values per (row or column) of some matrix.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

metacells.utilities.computation.sum_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) → Any[source]¶: Compute the sum of all the values in a matrix.

metacells.utilities.computation.nnz_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) → Any[source]¶: Compute the number of non-zero entries in a matrix.

Note

If given a sparse matrix, this returns the number of structural non-zeros, that is, the number of entries we actually store data for, even if this data is zero. Use metacells.utilities.typing.eliminate_zeros() if you suspect the sparse matrix of containing structural zero data values.

metacells.utilities.computation.mean_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) → Any[source]¶: Compute the mean of all the values in a matrix.

metacells.utilities.computation.max_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) → Any[source]¶: Compute the maximum of all the values in a matrix.

metacells.utilities.computation.min_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) → Any[source]¶: Compute the minimum of all the values in a matrix.

metacells.utilities.computation.nanmean_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) → Any[source]¶: Compute the mean of all the non-NaN values in a matrix.

metacells.utilities.computation.nanmax_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) → Any[source]¶: Compute the maximum of all the non-NaN values in a matrix.

metacells.utilities.computation.nanmin_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) → Any[source]¶: Compute the minimum of all the non-NaN values in a matrix.

metacells.utilities.computation.rank_matrix_by_layout(matrix: ndarray, ascending: bool) → Any[source]¶

Replace each element of the matrix with its rank (in row for row_major, in column for column_major).

If ascending then rank 1 is the minimal element. Otherwise, rank 1 is the maximal element.

metacells.utilities.computation.bincount_vector(vector: ndarray | Collection[int] | Collection[float] | PandasSeries, *, minlength: int = 0) → ndarray[source]¶: Drop-in replacement for numpy.bincount, which is timed and works for any vector data.

metacells.utilities.computation.most_frequent(vector: ndarray | Collection[int] | Collection[float] | PandasSeries) → Any[source]¶

Return the most frequent value in a vector.

This is useful for metacells.tools.convey.convey_obs_to_group().

metacells.utilities.computation.strongest(vector: ndarray | Collection[int] | Collection[float] | PandasSeries) → Any[source]¶

Return the strongest (maximal absolute) value in a vector.

This is useful for metacells.tools.convey.convey_obs_to_group().

Return the value with the highest total weights in a vector.

This is useful for metacells.tools.project.convey_atlas_to_query().

Return the weighted mean (using the weights and the values in the vector).

This is useful for metacells.tools.project.convey_atlas_to_query().

metacells.utilities.computation.fraction_of_grouped(value: Any) → Callable[[ndarray | Collection[int] | Collection[float] | PandasSeries], Any][source]¶: Return a function, that takes a vector and returns the fraction of elements of the vector which are equal to a specific value.

Downsample the data per (one of row and column) such that the sum of each one becomes samples.

If the matrix is sparse, and eliminate_zeros (default: True), then perform a final phase of eliminating leftover zero values from the compressed format. This means the result will be in “canonical format” so further scipy sparse operations on it will be faster.

If inplace (default: False), modify the matrix in-place, otherwise, return a modified copy.

A non-zero random_seed will make the operation replicable.

metacells.utilities.computation.downsample_vector(vector: ndarray | Collection[int] | Collection[float] | PandasSeries, samples: int, *, output: ndarray | None = None, random_seed: int) → None[source]¶

Downsample a vector of sample counters.

Input

A numpy vector containing non-negative integer sample counts.
A desired total number of samples.
An optional numpy array output to hold the results (otherwise, the input is overwritten).
A random_seed (non-zero for reproducible results).

The arrays may have any of the data types: float32,``float64``,``int32``,``int64``,``uint32``,``uint64``.

Operation

If the total number of samples (sum of the array) is not higher than the required number of samples, the output is identical to the input.

Otherwise, treat the input as if it was a set where each index appeared its value number of times. Randomly select the desired number of samples from this set (without repetition), and store in the output the number of times each index was chosen.

metacells.utilities.computation.matrix_rows_folds_and_aurocs(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, columns_subset: ndarray, columns_scale: ndarray | None = None, normalization: float) → Tuple[ndarray, ndarray][source]¶

Given a matrix and a subset of the columns, return two vectors. The first contains, for each row, the mean column value in the subset divided by the mean column value outside the subset. The second contains for each row the area under the receiver operating characteristic (AUROC) for the row, that is, the probability that a random column in the subset would have a higher value in this row than a random column outside the subset.

If columns_scale is specified, the data is divided by this scale before computing the AUROC.

metacells.utilities.computation.sliding_window_function(vector: ndarray | Collection[int] | Collection[float] | PandasSeries, *, function: str, window_size: int, order_by: ndarray | None = None) → ndarray[source]¶

Return a vector of the same size as the input vector, where each entry is the result of applying the function (one of mean, median, std, var) to a sliding window of size window_size centered on the entry.

If order_by is specified, the vector is first sorted by this order, and the end result is unsorted back to the original order. That is, the sliding window centered at each position will contain the window_size of entries which have the nearest order_by values to the center entry.

Note

The window size should be an odd positive integer. If an even value is specified, it is automatically increased by one.

metacells.utilities.computation.patterns_matches(patterns: str | Pattern | Collection[str | Pattern], strings: Collection[str], invert: bool = False) → ndarray[source]¶

Given a collection of (case-insensitive) strings, return a boolean mask specifying which of them match the given regular expression patterns.

If invert (default: {invert}), invert the mask.

metacells.utilities.computation.compress_indices(indices: ndarray | Collection[int] | Collection[float] | PandasSeries) → ndarray[source]¶

Given a vector of group indices per element, return a vector where the group indices are consecutive.

If the group indices contain -1 (“outliers”), then it is preserved as -1 in the result.

metacells.utilities.computation.bin_pack(element_sizes: ndarray | Collection[int] | Collection[float] | PandasSeries, max_bin_size: float) → ndarray[source]¶

Given a vector of element_sizes return a vector containing the bin number for each element, such that the total size of each bin is at most, and as close to as possible, to the max_bin_size.

This uses the first-fit decreasing algorithm for finding an initial solution and then moves elements around to minimize the l2 norm of the wasted space in each bin.

metacells.utilities.computation.bin_fill(element_sizes: ndarray | Collection[int] | Collection[float] | PandasSeries, min_bin_size: float) → ndarray[source]¶

Given a vector of element_sizes return a vector containing the bin number for each element, such that the total size of each bin is at least, and as close to as possible, to the min_bin_size.

This uses the first-fit decreasing algorithm for finding an initial solution and then moves elements around to minimize the l2 norm of the wasted space in each bin.

Given a matrix, and a vector of groups per column or row, return a matrix with a column or row per group, containing the sum of the groups columns or rows, and a vector of sizes (the number of summed columns or rows) per group.

Negative group indices (“outliers”) are ignored and their data is not included in the result. If there are no non-negative group indices, returns None.

If per is None, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one of row or column, and the matrix must be in the appropriate layout (row_major operating on rows, column_major for operating on columns).

If transform is not None, it is applied to the data before summing it.

metacells.utilities.computation.shuffle_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str, random_seed: int) → None[source]¶

Shuffle (in-place) the matrix data per column or row.

The matrix must be in the appropriate layout (row_major for shuffling data in each row, column_major for shuffling data in each column).

A non-zero random_seed (non-zero for reproducible results) will make the operation replicable.

metacells.utilities.computation.cover_diameter(*, points_count: int, area: float, cover_fraction: float) → float[source]¶: Return the diameter to give to each point so that the total area of points_count will be a cover_fraction of the total area.

metacells.utilities.computation.cover_coordinates(x_coordinates: ndarray | Collection[int] | Collection[float] | PandasSeries, y_coordinates: ndarray | Collection[int] | Collection[float] | PandasSeries, *, cover_fraction: float = 0.3333333333333333, noise_fraction: float = 1.0, random_seed: int) → Tuple[ndarray, ndarray][source]¶

Given x/y coordinates of points, move them so that the total area covered by them is cover_fraction (default: 0.3333333333333333) of the total area of their bounding box, assuming each has the diameter of their minimal distance. The points are jiggled around by the noise_fraction of their minimal distance using the random_seed (non-zero for reproducible results).

Returns new x/y coordinates vectors.

metacells.utilities.computation.random_piles(elements_count: int, target_pile_size: int, *, random_seed: int) → ndarray[source]¶

Split elements_count elements into piles of a size roughly equal to target_pile_size.

Return a vector specifying the pile index of each element.

Specify a non-zero random_seed to make this replicable.

metacells.utilities.computation.represent(goal: ndarray, basis: ndarray) → Tuple[float, ndarray] | None[source]¶

Represent a goal vector as a weighted average of the row vectors of some basis matrix.

This computes a non-negative weight for each matrix row, such that the sum of weights is 1, minimizing the distance (L2 norm) between the goal vector and the weighted average of the basis vectors. This is a convex problem quadratic subject to a linear constraint, so cvxpy solves it efficiently.

The return value is a tuple with the score of the weights vector, and the weights vector itself.

metacells.utilities.computation.min_cut(weights: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) → Tuple[Cut, float | None][source]¶

Find the minimal cut that will split an undirected graph (with a symmetrical weights matrix).

Returns the igraph.Cut object describing the cut, and the scale-invariant strength of the cut edges. This strength is the ratio between the mean weight of an edge connecting a random node in each partition and the mean weight of an edge connecting two random nodes inside a random partition. If either of the partitions contains no edges (e.g. contains a single node), the strength will be None.

metacells.utilities.computation.sparsify_matrix(full: ndarray | CompressedMatrix, min_column_max_value: float, min_entry_value: float, abs_values: bool) → CompressedMatrix[source]¶: Given a full matrix, return a sparse matrix such that all non-zero entries are at least min_entry_value, and columns that have no value above min_column_max_value are set to all-zero. If abs_values consider the absolute values when comparing to the thresholds.

Typing¶

The code has to deal with many different alternative data types for what is essentially two basic data types: 2D matrices and 1D vectors.

Specifically, we have pandas data frames and series, Scipy sparse matrices, and numpy multi-dimensional arrays (not to mention the deprecated numpy matrix type).

Python has the great ability to “duck type”, so in an ideal world, we could just pretend these are just two data types and be done. In practice, this is hopelessly broken.

First, even operations that exists for all data types sometimes have different interfaces (as in, np.foo(matrix, ...) vs. matrix.foo(...)).

Second, operating on sparse and dense data often requires completely different code paths.

This makes it very easy to write code that works today and breaks tomorrow when someone passes a pandas series to a function that expects a numpy array and it just almost works correctly (and god help the poor soul that mixes up a numpy matrix with a numpy 2d array, or passes a categorical pandas series to something that expects a series of strings).

“Eternal vigilance is the cost of freedom” - the solution here is to define a bunch of fake types, which are almost entirely for the benefit of the mypy type checker (with some run-time assertions as well).

This not only makes the code intent explicit (“explicit is better than implicit”) but also allows us to leverage mypy to catch errors such as applying a numpy operation on a sparse matrix, etc.

To put some order in this chaos, the following concepts are used:

Shaped is any 1d or 2d data in any format we can work with. Matrix is any 2d data, and Vector is any 1d data.
For 2D data, we allow multiple data types that we can’t directly operate on: most SparseMatrix layouts, PandasFrame and np.matrix have strange quirks when it comes to directly operating on them and should be avoided, while CSR and CSC CompressedMatrix sparse matrices and properly-laid-out 2D numpy arrays NumpyMatrix are in general well-behaved. We therefore introduce the concept of ProperMatrix vs. ImproperMatrix types, and provide functions that manipulate whether the “proper” data is in row-major or column-major order.
For 1D data, we just distinguish between PandasSeries and 1D numpy NumpyVector arrays, as these are the only types we allow. In theory we could have also allowed for sparse vectors but mercifully these are very uncommon so we can just ignore them.

Ironically, now that numpy added type annotations, the usefulness of the type hints added here has decreased, since both NumpyVector and NumpyMatrix are aliases to the same numpy.ndaarray type. Perhaps in the future numpy would allow for using Annotated types (with explicit number of dimensions, or even - gasp - the element data type) to allow for more useful type annotations. Or this could all be ported to Julia and avoid this whole mess.

metacells.utilities.typing.CPP_DATA_TYPES = ['float32', 'float64', 'int32', 'int64', 'uint32', 'uint64']¶: The data types supported by the C++ extensions code.

metacells.utilities.typing.Shaped¶

Shaped data of any of the types we can deal with.

alias of Union[ndarray, CompressedMatrix, PandasFrame, SparseMatrix, Collection[int], Collection[float], PandasSeries]

metacells.utilities.typing.ProperShaped¶

A “proper” 1- or 2-dimensional data.

alias of Union[ndarray, CompressedMatrix]

metacells.utilities.typing.ImproperShaped¶

An “improper” 1- or 2- dimensional data.

alias of Union[PandasFrame, SparseMatrix, Collection[int], Collection[float], PandasSeries]

metacells.utilities.typing.Matrix¶

A mypy type for any 2-dimensional data.

alias of Union[ndarray, CompressedMatrix, PandasFrame, SparseMatrix]

metacells.utilities.typing.ProperMatrix¶

A mypy type for “proper” 2-dimensional data.

“Proper” data allows for direct processing without having to mess with its formatting.

alias of Union[ndarray, CompressedMatrix]

metacells.utilities.typing.NumpyMatrix¶: Numpy 2-dimensional data.

Note

This is not to be confused with numpy.matrix which must not be used, but is returned by the occasional function, and would wreak havoc on the semantics of some operations unless immediately concerted to a proper NumpyMatrix, which is a simple 2-dimensional ndarray.

class metacells.utilities.typing.CompressedMatrix[source]¶

A mypy type for sparse CSR/CSC 2-dimensional data.

Should have been CompressedMatrix = sp..._cs_matrix.

metacells.utilities.typing.ImproperMatrix¶

A mypy type for “improper” 2-dimensional data.

“Improper” data contains or can be converted to “proper” data.

alias of Union[PandasFrame, SparseMatrix]

class metacells.utilities.typing.SparseMatrix[source]¶

A mypy type for sparse 2-dimensional data.

Should have been SparseMatrix = sp.base.spmatrix.

class metacells.utilities.typing.PandasFrame[source]¶

A mypy type for pandas 2-dimensional data.

Should have been PandasFrame = pd.DataFrame.

metacells.utilities.typing.Vector¶

A mypy type for any 1-dimensional data.

alias of Union[ndarray, Collection[int], Collection[float], PandasSeries]

metacells.utilities.typing.NumpyVector¶: Numpy 1-dimensional data.

metacells.utilities.typing.ImproperVector¶

An “improper” 1-dimensional data.

alias of Union[Collection[int], Collection[float], PandasSeries]

class metacells.utilities.typing.PandasSeries[source]¶

A mypy type for pandas 1-dimensional data.

Should have been PandasSeries = pd.Series.

metacells.utilities.typing.is_1d(shaped: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries) → bool[source]¶: Test whether the shaped is 1-dimensional.

metacells.utilities.typing.is_2d(shaped: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries) → bool[source]¶: Test whether the shaped is 2-dimensional.

metacells.utilities.typing.maybe_numpy_vector(shaped: Any) → ndarray | None[source]¶: Return the shaped as a NumpyVector, if it is one.

metacells.utilities.typing.maybe_numpy_matrix(shaped: Any) → ndarray | None[source]¶: Return the shaped as a NumpyMatrix, if it is one.

Note

This looks for a 2-dimensional numpy.ndarray which is not a numpy.matrix. Do not use numpy.matrix - it is deprecated and behaves subtly different to a 2-dimensional numpy.ndarray leading to hard-to-find bugs.

metacells.utilities.typing.maybe_sparse_matrix(shaped: Any) → SparseMatrix | None[source]¶: Return shap as a SparseMatrix, if it is one.

Note

This will succeed for a CompressedMatrix which is a sub-type of a SparseMatrix.

metacells.utilities.typing.maybe_compressed_matrix(shaped: Any) → CompressedMatrix | None[source]¶: Return shaped as a CompressedMatrix, if it is one.

metacells.utilities.typing.maybe_pandas_frame(shaped: Any) → PandasFrame | None[source]¶: Return shaped s a PandasFrame, if it is one.

metacells.utilities.typing.maybe_pandas_series(shaped: Any) → PandasSeries | None[source]¶: Return shaped as a PandasSeries, if it is one.

metacells.utilities.typing.mustbe_numpy_vector(shaped: Any) → ndarray[source]¶: Return shaped as a NumpyVector, asserting it must be one.

metacells.utilities.typing.mustbe_numpy_matrix(shaped: Any) → ndarray[source]¶: Return shaped as a NumpyMatrix, asserting it must be one.

Note

This looks for a 2-dimensional numpy.ndarray which is not a numpy.matrix. Do not use numpy.matrix - it is deprecated and behaves subtly different to a 2-dimensional numpy.ndarray leading to hard-to-find bugs.

metacells.utilities.typing.mustbe_sparse_matrix(shaped: Any) → SparseMatrix[source]¶: Return shaped as a SparseMatrix, asserting it must be one.

Note

This will succeed for a CompressedMatrix which is a sub-type of a SparseMatrix.

metacells.utilities.typing.mustbe_compressed_matrix(shaped: Any) → CompressedMatrix[source]¶: Return shaped as a CompressedMatrix, asserting it must be one.

metacells.utilities.typing.mustbe_pandas_frame(shaped: Any) → PandasFrame[source]¶: Return shaped as a PandasFrame, asserting it must be one.

metacells.utilities.typing.mustbe_pandas_series(shaped: Any) → PandasSeries[source]¶: Return shaped as a PandasSeries, asserting it must be one.

metacells.utilities.typing.to_proper_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, default_layout: str = 'row_major') → ndarray | CompressedMatrix[source]¶

Given some 2D matrix, return in in a ProperMatrix format we can safely process.

If the data is in some strange sparse format, use default_layout (default: row_major) to decide whether to return it in row_major (CSR) or column_major (CSC) layout.

Similar to to_proper_matrix() but return a tuple with the proper matrix and also its NumpyMatrix representation and its py:const:CompressedMatrix representation. Exactly one of these two representations will be None.

If the data is in some strange sparse format, use default_layout (default: {default_layout}) to decide whether to return it in row_major (CSR) or column_major (CSC) layout.

This is used to pick between dense and compressed code paths, and provides typed references so mypy can type-check each of these paths:

proper, dense, compressed = to_proper_matrices(matrix)

... Common code path can use the proper matrix value ...

if dense is not None:
    assert compressed is None
    ... Dense code path can use the dense matrix ...

else:
    assert compressed is not None
    ... Compressed code path can use the compressed matrix value ...

    if metacells.ut.matrix_layout(compressed) == 'row_major':
        ... CSR code path ...
    else:
        ... CSC code path ...

metacells.utilities.typing.to_pandas_series(vector: ndarray | Collection[int] | Collection[float] | PandasSeries | None = None, *, index: ndarray | Collection[int] | Collection[float] | PandasSeries | None = None) → PandasSeries[source]¶: Construct a pandas series from any Vector.

metacells.utilities.typing.to_pandas_frame(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | None = None, *, index: ndarray | Collection[int] | Collection[float] | PandasSeries | None = None, columns: ndarray | Collection[int] | Collection[float] | PandasSeries | None = None) → PandasFrame[source]¶: Construct a pandas frame from any Matrix.

metacells.utilities.typing.frozen(shaped: ndarray | CompressedMatrix | PandasFrame | PandasSeries) → bool[source]¶: Test whether the shaped data is protected against future modification.

metacells.utilities.typing.freeze(shaped: ndarray | CompressedMatrix | PandasFrame | PandasSeries) → None[source]¶: Protect the shaped data against future modification.

metacells.utilities.typing.unfreeze(shaped: ndarray | CompressedMatrix | PandasFrame | PandasSeries) → None[source]¶: Permit future modification of some shaped data.

metacells.utilities.typing.unfrozen(proper: ndarray | CompressedMatrix) → Iterator[None][source]¶: Execute some in-place modification, temporarily unfreezing the proper shaped data.

metacells.utilities.typing.to_numpy_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, default_layout: str = 'row_major', copy: bool = False, only_extract: bool = False) → ndarray[source]¶

Convert any Matrix to a dense 2-dimensional NumpyMatrix.

If copy (default: False), a copy of the data is returned even if no conversion needed to be done.

If only_extract (default: False), then assert this only extracts the data inside some pandas data.

If the data is in some strange sparse format, use default_layout (default: row_major) to decide whether to return it in row_major (CSR) or column_major (CSC) layout.

Convert any Vector, or a Matrix where one of the dimensions has size one, to a NumpyVector.

If copy (default: False), a copy of the data is returned even if no conversion needed to be done.

If only_extract (default: False), then assert this only extracts the data inside some pandas data.

metacells.utilities.typing.DENSE_FAST_FLAG = {'column_major': 'F_CONTIGUOUS', 'row_major': 'C_CONTIGUOUS'}¶: Which flag indicates efficient 2D dense matrix layout.

metacells.utilities.typing.SPARSE_FAST_FORMAT = {'column_major': 'csc', 'row_major': 'csr'}¶: Which format indicates efficient 2D sparse matrix layout.

metacells.utilities.typing.SPARSE_SLOW_FORMAT = {'column_major': 'csr', 'row_major': 'csc'}¶: Which format indicates inefficient 2D sparse matrix layout.

metacells.utilities.typing.LAYOUT_OF_AXIS = ('row_major', 'column_major')¶: The layout by the axis parameter.

metacells.utilities.typing.PER_OF_AXIS = ('row', 'column')¶: When reducing data, get results per row or column (by the axis parameter).

metacells.utilities.typing.shaped_dtype(shaped: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries) → str[source]¶: Return the data type of the element of shaped data.

metacells.utilities.typing.matrix_layout(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) → str | None[source]¶

Return which layout the matrix is arranged by (row_major or column_major).

If the data is in some strange sparse format, returns None.

metacells.utilities.typing.is_layout(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, layout: str | None) → bool[source]¶

Test whether the matrix is arranged according to the layout.

This will always succeed if the layout is None.

metacells.utilities.typing.is_contiguous(vector: ndarray | Collection[int] | Collection[float] | PandasSeries) → bool[source]¶

Return whether the vector is contiguous in memory.

This is only True for a dense vector.

metacells.utilities.typing.to_contiguous(vector: ndarray | Collection[int] | Collection[float] | PandasSeries, *, copy: bool = False) → ndarray[source]¶

Return the vector in contiguous (dense) format.

If copy (default: {copy}), a copy of the data is returned even if no conversion needed to be done.

Assert that some data is in canonical format.

For numpy matrices or vectors, this means the data is contiguous (for matrices, in either row-major or column-major order).

For sparse matrices, it means the data is in COO format, or compressed (CSC or CSR format), with sorted indices and no duplicates.

In general, we’d like all the data stored in AnnData to be canonical.

Return whether the data is in canonical format.

For numpy matrices or vectors, this means the data is contiguous (for matrices, in either row-major or column-major order).

For sparse matrices, it means the data is in COO format, or compressed (CSC or CSR format), with sorted indices and no duplicates.

In general, we’d like all the data stored in AnnData to be canonical.

metacells.utilities.typing.eliminate_zeros(compressed: CompressedMatrix) → None[source]¶: Eliminate zeros in a compressed matrix.

metacells.utilities.typing.sort_indices(compressed: CompressedMatrix) → None[source]¶: Ensure the indices are sorted in each row/column.

metacells.utilities.typing.sum_duplicates(compressed: CompressedMatrix) → None[source]¶: Eliminate duplicates in a compressed matrix.

metacells.utilities.typing.shaped_checksum(shaped: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries) → float[source]¶: Return a checksum of the contents of shaped data (for debugging reproducibility).

Parallel¶

Due to the notorious GIL, using multiple Python threads is essentially useless. This leaves us with two options for using multiple processors, which is mandatory for reasonable performance on the large data sets we work on:

Use multiple threads in the internal C++ implementation of some Python functions; this is done by both numpy and the C++ extension functions provided by this package, and works even for reasonably small sized work, such as sorting each of the rows of a large matrix.
Use Python multi-processing. This is costly and works only for large sized work, such as computing metacells for different piles.

Each of these two approaches works tolerably well on its own, even though both are sub-optimal. The problem starts when we want to combine them. Consider a server with 50 processors. Invoking corrcoef on a large matrix will use them all. This is great if one computes metacells for a single pile. Suppose, however, you want to compute metacells for 50 piles, and do so using multi-processing. Each and every of the 50 sub-processes will invoke corcoeff which will spawn 50 internal threads, resulting in the operating system seeing 2500 processes competing for the same 50 hardware processors. “This does not end well.”

You would expect that, two decades after multi-core systems became available, this would have been solved “out of the box” by the parallel frameworks (Python, OpenMP, TBB, etc.) all agreeing to cooperate with each other. However, somehow this isn’t seen as important by the people maintaining these frameworks; in fact, most of them don’t properly handle nested parallelism within their own framework, never mind playing well with others.

So in practice, while languages built for parallelism (such as Julia and Rust) deal well with nested parallel construct, using a mixture of older serial languages (such as Python and C++) puts us in a swamp, and “you can’t build a castel in a swamp”. In our case, numpy uses some underlying parallel threads framework, our own extensions uses OpenMP parallel threads, and we are forced to use the Python-multi-processing framework itself on top of both, and each of these frameworks is blind to the others.

As a crude band-aid, we force both whatever-numpy-uses and OpenMP to use a specific number of threads. So, when we use multi-processing, we limit each sub-process to use less internal threads, such that the total will be at most 50. This very sub-optimal, but at least it doesn’t bring the server to its knees trying to deal with a total load of 2500 processes.

A final twist on all this is that hyper-threading is (worse than) useless for heavy compute threads. We therefore by default only use one thread per physical cores. We get the number pf physical cores using the psutil package.

metacells.utilities.parallel.is_main_process() → bool[source]¶: Return whether this is the main process, as opposed to a sub-process spawned by parallel_map().

metacells.utilities.parallel.set_processors_count(processors: int) → None[source]¶

Set the (maximal) number of processors to use in parallel.

The default value of 0 means using all the available physical processors. Note that if hyper-threading is enabled, this would be less than (typically half of) the number of logical processors in the system. This is intentional, as there’s no value - actually, negative value - in running multiple heavy computations on hyper-threads of the same physical processor.

Otherwise, the value is the actual (positive) number of processors to use. Override this by setting the METACELLS_PROCESSORS_COUNT environment variable or by invoking this function from the main thread.

metacells.utilities.parallel.get_processors_count() → int[source]¶: Return the number of PROCESSORs we are allowed to use.

metacells.utilities.parallel.parallel_map(function: Callable[[int], T], invocations: int, *, max_processors: int = 0, hide_from_progress_bar: bool = False) → List[T][source]¶

Execute function, in parallel, invocations times. Each invocation is given the invocation’s index as its single argument.

For our simple pipelines, only the main process is allowed to execute functions in parallel processes, that is, we do not support nested parallel_map calls.

This uses get_processors_count() processes. If max_processors (default: 0) is zero, use all available processors. Otherwise, further reduces the number of processes used to at most the specified value.

If this ends up using a single process, runs the function serially. Otherwise, fork new processes to execute the function invocations (using multiprocessing.get_context('fork').Pool.map).

The downside is that this is slow, and you need to set up mutable shared memory (e.g. for large results) in advance. The upside is that each of these processes starts with a shared memory copy(-on-write) of the full Python state, that is, all the inputs for the function are available “for free”.

If a progress bar is active at the time of invoking parallel_map, and hide_from_progress_bar is not set, then it is assumed the parallel map will cover all the current (slice of) the progress bar, and it is reported into it in increments of 1/invocations.

Progress¶

This used tqdm to provide a progress bar while computing the metacells.

metacells.utilities.progress.progress_bar(**tqdm_kwargs: Any) → Any[source]¶

Run some code with a tqdm progress bar.

..note:

When a progress bar is active, logging is restricted to warnings and errors.

metacells.utilities.progress.progress_bar_slice(fraction: float | None) → Any[source]¶

Run some code which will use a slice of the current progress bar.

This can be nested to split the overall progress bar into smaller and smaller parts to represent a tree of computations.

If fraction is None, or there is no active progress bar, simply runs the code.

metacells.utilities.progress.did_progress(fraction: float) → Any[source]¶: Report progress of some fraction of the current (slice of) progress bar.

metacells.utilities.progress.has_progress_bar() → bool[source]¶: Return whether there is an active progress bar.

metacells.utilities.progress.start_progress_bar(**tqdm_kwargs: Any) → Any[source]¶: Create a progress bar (but do not show it yet).

metacells.utilities.progress.end_progress_bar() → None[source]¶: End an active progress bar.

metacells.utilities.progress.start_progress_bar_slice(fraction: float) → Tuple[int, int][source]¶

Start a nested slice of the overall progress bar.

Returned the captured state that needs to be passed to end_progress_bar_slice.

metacells.utilities.progress.end_progress_bar_slice(old_state: Tuple[int, int]) → None[source]¶

End a nested slice of the overall progress bar.

This moves the progress bar position to the end of the slice regardless of reported progress within it.

Timing¶

The first step in achieving reasonable performance is identifying where most of the time is being spent. The functions in this module allow to easily collect timing information about the relevant functions or steps within functions in a controlled way, with low overhead, as opposed to collecting information about all functions which has higher overheads and produces mountains of mostly irrelevant data.

metacells.utilities.timing.collect_timing(collect: bool, path: str = 'timing.csv', mode: str = 'a', *, buffering: int = 1) → None[source]¶

Specify whether, where and how to collect timing information.

By default, we do not. Override this by setting the METACELLS_COLLECT_TIMING environment variable to true, or by invoking this function from the main thread.

By default, the data is written to the path is timing.csv, which is opened with the mode is a and using the buffering is 1. Override this by setting the METACELL_TIMING_PATH, METACELL_TIMING_MODE and/or the METACELL_TIMING_BUFFERING environment variables, or by invoking this function from the main thread.

This will flush and close the previous timing file, if any.

The file is written in CSV format (without headers). The first three fields are:

The invocation context (a .-separated path of “relevant” function/step names).
The elapsed time (in nanoseconds) in this context (not counting nested contexts).
The CPU time (in nanoseconds) in this context (not counting nested contexts).

This may be followed by a series of name,value pairs describing parameters of interest for this context, such as data sizes and layouts, to help understand the performance of the code.

metacells.utilities.timing.flush_timing() → None[source]¶: Flush the timing information, if we are collecting it.

metacells.utilities.timing.in_parallel_map(map_index: int, process_index: int) → None[source]¶

Reconfigure timing collection when running in a parallel sub-process via metacells.utilities.parallel.parallel_map().

This will direct the timing information from <timing>.csv to <timing>.<map>.<process>.csv (where <timing> is from the original path, <map> is the serial number of the metacells.utilities.parallel.parallel_map() invocation, and <process> is the serial number of the process in the map).

Collecting the timing of separate sub-processes to separate files allows us to freely write to them without locks and synchronizations which improves the performance (reduces the overhead of collecting timing information).

You can just concatenate the files when the run is complete, or use a tool which automatically collects the data from all the files, such as metacells.scripts.timing.

metacells.utilities.timing.log_steps(log: bool) → None[source]¶

Whether to log every step invocation.

By default, we do not. Override this by setting the METACELLS_LOG_ALL_STEPS environment variable to true or by invoking this function from the main thread.

Note

This only works if collect_timing() was set. It is a crude instrument to hunt for deadlocks, very-long-running numpy functions, and the like. Basically, if the program is taking 100% CPU and you have no idea what it is doing, turning this on and looking at the last logged step name would give you some idea of where it is stuck.

metacells.utilities.timing.timed_step(name: str) → Iterator[None][source]¶

Collect timing information for a computation step.

Expected usage is:

with ut.timed_step("foo"):
    some_computation()

If we are collecting timing information, then for every invocation, the program will append a line similar to:

foo,elapsed_ns,123,cpu_ns,456

To a timing log file (default: timing.csv). Additional fields can be appended to the line using the metacells.utilities.timing.parameters function.

If the name starts with a . of a _, then it is prefixed with the names of the innermost surrounding step name (which must exist). This is commonly used to time sub-steps of a function.

metacells.utilities.timing.timed_call(name: str | None = None) → Callable[[CALLABLE], CALLABLE][source]¶

Automatically wrap each invocation of the decorated function with metacells.utilities.timing.timed_step() using the name (by default, the function’s __qualname__).

Expected usage is:

@ut.timed_call()
def some_function(...):
    ...

metacells.utilities.timing.timed_parameters(**kwargs: Any) → None[source]¶

Associate relevant timing parameters to the innermost metacells.utilities.timing.timed_step().

The specified arguments are appended at the end of the generated timing.csv line. For example, timed_parameters(foo=2, bar=3) would add foo,2,bar,3 to the line in timing.csv.

This allows tracking parameters which affect invocation time (such as array sizes), to help identify the causes for the long-running operations.

metacells.utilities.timing.context() → str[source]¶: Return the full current context (path of metacells.utilities.timing.timed_step()-s leading to the current point).

Note

The context will be the empty string unless we are actually collecting timing.

metacells.utilities.timing.current_step() → StepTiming | None[source]¶: The timing collector for the innermost (current) metacells.utilities.timing.timed_step(), if any.

class metacells.utilities.timing.StepTiming(name: str, parent: StepTiming | None)[source]¶

Timing information for some named processing step.

parent¶: The parent step, if any.

context: str¶: The full context of the processing step.

parameters: List[str]¶: Parameters of interest of the processing step.

thread_name¶: The thread the step was invoked in.

total_nested¶: The amount of CPU used in nested steps in the same thread.

class metacells.utilities.timing.Counters(*, elapsed_ns: int = 0, cpu_ns: int = 0)[source]¶

The counters for the execution times.

elapsed_ns¶: Elapsed time counter.

cpu_ns¶: CPU time counter.

static now() → Counters[source]¶: Return the current value of the counters.

Logging¶

This provides a useful formatter which includes high-resolution time and thread names, and a set of utility functions for effective logging of operations on annotation data.

Collection of log messages is mostly automated by wrapping relevant function calls and tracing the setting and getting of data via the metacells.utilities.annotation accessors, with the occasional explicit logging of a notable intermediate calculated value via log_calc().

The ging is picking the correct level for each log message. This module provides the following log levels which hopefully provide the end user with a reasonable amount of control:

INFO will log only setting of the final results as annotations within the top-level AnnData object(s).
STEP will also log the top-level algorithm step(s), which give a very basic insight into what was executed.
PARAM will also log the parameters of these steps, which may be important when tuning the behavior of the system for different data sets.
CALC will also log notable intermediate calculated results, which again may be important when tuning the behavior of the system for different data sets.
DEBUG pulls all the stops and logs all the above, not only for the top-level steps, but also for the nested processing steps. This results in a rather large log file (especially for the recursive divide-and-conquer algorithm). You don’t need this except for when you really need this.

To achieve this, we track for each AnnData whether it is a top-level (user visible) or a temporary data object, and whether we are inside a top-level (user invoked) or a nested operation. Accessing top-level data and invoking top-level operations is logged at the coarse logging levels, anything else is logged at the DEBUG level.

To improve the log messages, we allow each AnnData object to have an optional name for logging (see metacells.utilities.annotation.set_name() and metacells.utilities.annotation.get_name()). Whenever a temporary AnnData data is created, its name is extended by some descriptive suffix, so we get names like full.clean.select to describe the data selected from the clean data extracted out of the full data.

metacells.utilities.logging.setup_logger(*, level: int = 20, to: ~typing.IO = <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, time: bool = False, process: bool | None = None, name: str | None = None, long_level_names: bool | None = None) → Logger[source]¶

Setup the global logger().

Note

A second call will fail as the logger will already be set up.

If level is not specified, only INFO messages (setting values in the annotated data) will be logged.

If to is not specified, the output is sent to sys.stderr.

If time (default: False), include a millisecond-resolution timestamp in each message.

If name (default: None) is specified, it is added to each message.

If process (default: None), include the (sub-)process index in each message. The name of the main process (thread) is replaced to #0 to make it more compatible with the sub-process names (#<map-index>.<sub-process-index>).

If process is None, and if the logging level is higher than INFO, and metacells.utilities.parallel.get_processors_count() is greater than one, then process is set - that is, it will be set if we expect to see log messages from multiple sub-processes.

Logging from multiple sub-processes (e.g., using (e.g., using metacells.utilities.parallel.parallel_map()) will synchronize using a global lock so messages will not get garbled.

If long_level_names (default: None), includes the log level in each message. If is False, the log level names are shortened to three characters, for consistent formatting of indented (nested) log messages. If it is None, no level names are logged at all.

metacells.utilities.logging.logger() → Logger[source]¶

Access the global logger.

If setup_logger() has not been called yet, this will call it using the default flags. You should therefore call setup_logger() as early as possible to ensure you don’t end up with a misconfigured logger.

metacells.utilities.logging.CALC = 12¶: The log level for tracing intermediate calculations.

metacells.utilities.logging.STEP = 17¶: The log level for tracing processing steps.

metacells.utilities.logging.PARAM = 15¶: The log level for tracing parameters.

metacells.utilities.logging.logged(**kwargs: Callable[[Any], Any]) → Callable[[CALLABLE], CALLABLE][source]¶

Automatically wrap each invocation of the decorated function with logging it. Top-level calls are logged using the STEP log level, with parameters logged at the PARAM log level. Nested calls are logged at the DEBUG log level.

By default parameters are logged by simply converting them to a string, with special cases for AnnData, callable functions, boolean masks, vectors and matrices. You can override this by specifying parameter_name=convert_value_to_logged_value for the specific parameter.

Expected usage is:

@ut.logged()
def some_function(...):
    ...

metacells.utilities.logging.top_level(adata: AnnData) → None[source]¶: Indicate that the annotated data will be returned to the top-level caller, increasing its logging level.

metacells.utilities.logging.log_return(name: str, value: Any, *, formatter: Callable[[Any], Any] | None = None) → bool[source]¶

Log a value returned from a function with some name.

If formatter is specified, use it to override the default logged value formatting.

metacells.utilities.logging.logging_calc() → bool[source]¶: Whether we are actually logging the intermediate calculations.

metacells.utilities.logging.log_calc(name: str, value: Any = None, *, formatter: Callable[[Any], Any] | None = None) → bool[source]¶

Log an intermediate calculated value computed from a function with some name.

If formatter is specified, use it to override the default logged value formatting.

metacells.utilities.logging.log_step(name: str, value: Any = None, *, formatter: Callable[[Any], Any] | None = None) → Iterator[None][source]¶: Same as log_calc(), but also further indent all the log messages inside the with statement body.

metacells.utilities.logging.incremental(adata: AnnData, per: str, name: str, formatter: Callable[[Any], Any] | None = None) → None[source]¶: Declare that the named annotation will be built incrementally - set and then repeatedly modified.

metacells.utilities.logging.done_incrementals(adata: AnnData) → None[source]¶: Declare that all the incremental values have been fully computed.

metacells.utilities.logging.cancel_incrementals(adata: AnnData) → None[source]¶: Cancel tracking incremental annotations.

metacells.utilities.logging.log_set(adata: AnnData, per: str, name: str, value: Any, *, formatter: Callable[[Any], Any] | None = None) → bool[source]¶: Log setting some annotated data.

metacells.utilities.logging.log_get(adata: AnnData, per: str, name: Any, value: Any, *, formatter: Callable[[Any], Any] | None = None) → bool[source]¶: Log getting some annotated data.

metacells.utilities.logging.sizes_description(sizes: ndarray | Collection[int] | Collection[float] | PandasSeries | str) → str[source]¶: Return a string for logging an array of sizes.

metacells.utilities.logging.fractions_description(sizes: ndarray | Collection[int] | Collection[float] | PandasSeries | str) → str[source]¶: Return a string for logging an array of fractions (between zero and one).

metacells.utilities.logging.groups_description(groups: ndarray | Collection[int] | Collection[float] | PandasSeries | str) → str[source]¶: Return a string for logging an array of group indices.

Note

This assumes that the indices are consecutive, with negative values indicating “outliers”.

metacells.utilities.logging.mask_description(mask: str | ndarray | Collection[int] | Collection[float] | PandasSeries | CompressedMatrix | PandasFrame | SparseMatrix) → str[source]¶: Return a string for logging a boolean mask.

metacells.utilities.logging.ratio_description(denominator: float, element: str, numerator: float, condition: str, *, base: bool = True) → str[source]¶: Return a string for describing a ratio (including a percent representation).

metacells.utilities.logging.progress_description(amount: int, index: int, element: str) → str[source]¶: Return a string for describing progress in a loop.

metacells.utilities.logging.fraction_description(fraction: float | None) → str[source]¶: Return a string for describing a fraction (including a percent representation).

metacells.utilities.logging.fold_description(fold: float) → str[source]¶: Return a string for describing a fraction (including a percent representation).

Documentation¶

Utilities for documenting Python functions.

metacells.utilities.documentation.expand_doc(**kwargs: Any) → Callable[[CALLABLE], CALLABLE][source]¶

Expand the keyword arguments and the annotated function’s default argument values inside the function’s document string.

That is, given something like:

@expand_doc(foo=7)
def bar(baz, vaz=5):
    """
    Bar with {foo} foos and parameter vaz (default: {baz}).
    """

Then help(bar) will print:

Bar with 7 foos and parameter vaz (default: 5).