Utilities¶
Generic utilities used by the metacells code.
Arguably all(most) of these belong in more general package(s).
All the functions included here are exported under metacells.ut
.
Annotation¶
In general we are using AnnData
to hold the data being analyzed. However, the interface
of AnnData leaves some things out which are crucial for the proper working of our algorithm
(and any other algorithm that works at a scale of millions of cells).
X as an Annotation¶
For a uniform interface, we pretend the X
member is a per-variable-per-observation annotation
with the special name __x__
. This allows us to have APIs that take an annotation name and just
pass them (typically by default) the annotation “name” __x__
to force the code to run on the
X
data member.
In general the APIs allow specifying either annotation names or alternatively an explicit matrix (or vector for per-observation or per-variable annotations), for maximal usage flexibility.
Data Types¶
The generic AnnData
is cheerfully permissive when it comes to the data it contains. That is, when
accessing data, it isn’t clear whether you’ll be getting a numpy array or a pandas data series, and
for 2D data you might be getting all sort of data types (including sparse matrices of various
formats).
Python itself is very loose about the interface these data types provide - some operations such as
len
and shape
and accessing an element by integer indices are safe, more advanced operations
can silently produce the wrong results, and most operations work on a subset of the data types,
often with wildly incompatible interfaces.
To combat this, we have the metacells.utilities.typing
module which imposes some order on
the types zoo, and, in addition, we provide here accessor functions which return deterministic
usable data types, allowing for safe processing of the results. This is combined with the
py:mod:metacells.utilities.computation module which provides a set of operations that work
consistently on the few data types we use.
Data Layout¶
A related issue is the layout of 2D data. For small matrices, this doesn’t matter, but when dealing with large matrices (millions of rows/columns), performing a simple operation may takes orders of magnitude longer if applied to a matrix of the wrong layout.
To make things worse the builtin functions for converting between matrix layouts are pretty inefficient so more efficient variants are provided in the py:mod:metacells.utilities.computation module.
The accessors in this module allow for explicitly controlling the layout of the data they return,
and cache the different layouts of the same annotations of the AnnData
(under the reasonable
assumption that the original data is not modified). This allows for writing
guaranteed-to-be-efficient processing code.
Data Logging¶
A side benefit of exclusively using the accessors provided here is that they participate in the
automated logging provided by the metacells.utilities.logging
module. That is, using them
will automatically log writing the final results of a computation to the user at the INFO
log
level, while higher logging level give insight into the exact data being read and written by the
algorithm’s nested sub-steps.
- metacells.utilities.annotation.slice(adata: AnnData, *, name: str | None = None, obs: ndarray | Collection[int] | Collection[float] | PandasSeries | None = None, vars: ndarray | Collection[int] | Collection[float] | PandasSeries | None = None, track_obs: str | None = None, track_var: str | None = None, share_derived: bool = True, top_level: bool = True) AnnData [source]¶
Return new annotated data which includes a subset of the full
adata
.If
name
is not specified, the data will be unnamed. Otherwise, if it starts with a.
, it will be appended to the current name (if any). Otherwise,name
is the new name.If
obs
and/orvars
are specified, they should be set to either a boolean mask or a collection of indices to include in the data slice. In the case of an indices array, it is assumed the indices are unique and sorted, that is that their effect is similar to a mask.If
track_obs
and/ortrack_var
are specified, the result slice will include a per-observation and/or per-variable annotation containing the indices of the sliced elements in the original full data.If the slice happens to be the full original data, then this becomes equivalent to
copy_adata()
, and by default this willshare_derived
(share the derived data cache).
- metacells.utilities.annotation.copy_adata(adata: AnnData, *, name: str | None = None, share_derived: bool = True, top_level: bool = True) AnnData [source]¶
Return a copy of some annotated
adata
.If
name
is not specified, the data will be unnamed. Otherwise, if it starts with a.
, it will be appended to the current name (if any). Otherwise,name
is the new name.If
share_derived
isTrue
(the default), then the copy will share the derived data cache, which contains specific layout variants of matrix data and sums of columns/rows of matrix data. Use this if you intend to modify the copy in-place.Note
In general we assume annotated data is not modified in-place, but it might make sense to create a copy (not sharing derived data), modify it immediately (before accessing data in a specific layout), and then proceed to process it without further modifications.
- metacells.utilities.annotation.set_name(adata: AnnData, name: str | None) None [source]¶
Set the
name
of the data (for log messages).If the name starts with
.
it is appended to the current name, if any.
- metacells.utilities.annotation.get_name(adata: AnnData, default: str | None = None) str | None [source]¶
Return the name of the data (for log messages), if any.
If no name was set, returns the
default
.
- metacells.utilities.annotation.set_m_data(adata: AnnData, name: str, data: Any, *, formatter: Callable[[Any], Any] | None = None) Any [source]¶
Set unstructured data.
If
formatter
is specified, its results is used when logging the operation.
- metacells.utilities.annotation.get_m_data(adata: AnnData, name: str, *, formatter: Callable[[Any], Any] | None = None) Any [source]¶
Get metadata (unstructured annotation) in
adata
by itsname
.
- metacells.utilities.annotation.set_o_data(adata: AnnData, name: str, data: ndarray, *, formatter: Callable[[Any], Any] | None = None) Any [source]¶
Set per-observation (cell) data.
If
formatter
is specified, its results is used when logging the operation.
- metacells.utilities.annotation.get_o_series(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries, *, sum: bool = False, formatter: Callable[[Any], Any] | None = None) PandasSeries [source]¶
Get per-observation (cell) data in
adata
by itsname
as a pandas series.If
name
is a string, it is the name of a per-observation annotation to fetch. Otherwise, it should be some vector of data of the appropriate size.
- metacells.utilities.annotation.get_o_numpy(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries, *, sum: bool = False, formatter: Callable[[Any], Any] | None = None) ndarray [source]¶
Get per-observation (cell) data in
adata
by itsname
as a Numpy array.If
name
is a string, it is the name of a per-observation annotation to fetch. Otherwise, it should be some vector of data of the appropriate size.If
sum
isTrue
, thenname
should be the name of a per-observation-per-variable annotation, or a matrix, and this will return the sum (per row) of this data.
- metacells.utilities.annotation.get_o_names(adata: AnnData) ndarray [source]¶
Get a numpy vector of observation names.
- metacells.utilities.annotation.maybe_o_numpy(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries | None, *, sum: bool = False, formatter: Callable[[Any], Any] | None = None) ndarray | None [source]¶
Similar to
get_o_numpy()
, but ifname
isNone
, returnNone
.
- metacells.utilities.annotation.set_v_data(adata: AnnData, name: str, data: ndarray, *, formatter: Callable[[Any], Any] | None = None) Any [source]¶
Set per-variable (gene) data.
If
formatter
is specified, its results is used when logging the operation.
- metacells.utilities.annotation.get_v_series(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries, *, sum: bool = False, formatter: Callable[[Any], Any] | None = None) PandasSeries [source]¶
Get per-variable (gene) data in
adata
by itsname
as a pandas series.If
name
is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some vector of data of the appropriate size.
- metacells.utilities.annotation.get_v_numpy(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries, *, sum: bool = False, formatter: Callable[[Any], Any] | None = None) ndarray [source]¶
Get per-variable (gene) data in
adata
by itsname
as a numpy array.If
name
is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some vector of data of the appropriate size.
- metacells.utilities.annotation.get_v_names(adata: AnnData) ndarray [source]¶
Get a numpy vector of variable names.
- metacells.utilities.annotation.maybe_v_numpy(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries | None, *, sum: bool = False, formatter: Callable[[Any], Any] | None = None) ndarray | None [source]¶
Similar to
get_v_numpy()
, but ifname
isNone
, returnNone
.
- metacells.utilities.annotation.set_oo_data(adata: AnnData, name: str, data: ndarray | CompressedMatrix, *, formatter: Callable[[Any], Any] | None = None) Any [source]¶
Set per-observation-per-observation (cell) data.
If
formatter
is specified, its results is used when logging the operation.
- metacells.utilities.annotation.get_oo_frame(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) PandasFrame [source]¶
Get per-observation-per-observation (per-cell-per-cell) data as a pandas data frame.
If
name
is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some matrix of data of the appropriate size.If
layout
(default: {layout}) is specified, it must be one ofrow_major
orcolumn_major
. If this requires relayout of the data, the result is cached in a hidden data member for future reuse.Note
Since Pandas frames do not play well with sparse representations, this will always return the data as a dense matrix. For very large data, this may consume much more memory, so use with care.
- metacells.utilities.annotation.get_oo_proper(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) ndarray | CompressedMatrix [source]¶
Same as
get_oo_data
but returns ametacells.utilities.typing.ProperMatrix
.
- metacells.utilities.annotation.set_vv_data(adata: AnnData, name: str, data: ndarray | CompressedMatrix, *, formatter: Callable[[Any], Any] | None = None) Any [source]¶
Set per-variable-per-variable (gene) data.
If
formatter
is specified, its results is used when logging the operation.
- metacells.utilities.annotation.get_vv_frame(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) PandasFrame [source]¶
Get per-variable-per-variable (per-gene-per-gene) data as a pandas data frame.
If
name
is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some matrix of data of the appropriate size.If
layout
(default: {layout}) is specified, it must be one ofrow_major
orcolumn_major
. If this requires relayout of the data, the result is cached in a hidden data member for future reuse.Note
Since Pandas frames do not play well with sparse representations, this will always return the data as a dense matrix. For very large data, this may consume much more memory, so use with care.
- metacells.utilities.annotation.get_vv_proper(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) ndarray | CompressedMatrix [source]¶
Same as
get_vv_data
but returns ametacells.utilities.typing.ProperMatrix
.
- metacells.utilities.annotation.set_oa_data(adata: AnnData, name: str, data: ndarray | CompressedMatrix, *, formatter: Callable[[Any], Any] | None = None) Any [source]¶
Set per-observation-per-any (cell) data.
If
formatter
is specified, its results is used when logging the operation.
- metacells.utilities.annotation.get_oa_frame(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, columns: Collection | None, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) PandasFrame [source]¶
Get per-observation-per-any (per-cell-per-any) data as a pandas data frame.
Rows are observations (cells), indexed by the observation names (typically cell barcode). Columns are “something” - specify
columns
to specify an index.If
name
is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some matrix of data of the appropriate size.If
layout
(default: {layout}) is specified, it must be one ofrow_major
orcolumn_major
. If this requires relayout of the data, the result is cached in a hidden data member for future reuse.Note
Since Pandas frames do not play well with sparse representations, this will always return the data as a dense matrix. For very large data, this may consume much more memory, so use with care.
- metacells.utilities.annotation.get_oa_proper(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) ndarray | CompressedMatrix [source]¶
Same as
get_oa_data
but returns ametacells.utilities.typing.ProperMatrix
.
- metacells.utilities.annotation.set_va_data(adata: AnnData, name: str, data: ndarray | CompressedMatrix, *, formatter: Callable[[Any], Any] | None = None) Any [source]¶
Set per-variable-per-any (gene) data.
If
formatter
is specified, its results is used when logging the operation.
- metacells.utilities.annotation.get_va_frame(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, columns: Collection | None, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) PandasFrame [source]¶
Get per-variable-per-any (per-cell-per-any) data as a pandas data frame.
Rows are variables (genes), indexed by their names. Columns are “something” - specify
columns
to specify an index.If
name
is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some matrix of data of the appropriate size.If
layout
(default: {layout}) is specified, it must be one ofrow_major
orcolumn_major
. If this requires relayout of the data, the result is cached in a hidden data member for future reuse.Note
Since Pandas frames do not play well with sparse representations, this will always return the data as a dense matrix. For very large data, this may consume much more memory, so use with care.
- metacells.utilities.annotation.get_va_proper(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) ndarray | CompressedMatrix [source]¶
Same as
get_va_data
but returns ametacells.utilities.typing.ProperMatrix
.
- metacells.utilities.annotation.set_vo_data(adata: AnnData, name: str, data: ndarray | CompressedMatrix, *, formatter: Callable[[Any], Any] | None = None) Any [source]¶
Set per-variable-per-observation (per-gene-per-cell) data.
- metacells.utilities.annotation.get_vo_frame(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) PandasFrame [source]¶
Get per-variable-per-observation (per-gene-per-cell) data as a pandas data frame.
Rows are observations (cells), indexed by the observation names (typically cell barcode). Columns are variables (genes), indexed by their names.
If
name
is a string, it is the name of a per-variable annotation to fetch. Otherwise, it should be some matrix of data of the appropriate size.If
layout
(default: {layout}) is specified, it must be one ofrow_major
orcolumn_major
. If this requires relayout of the data, the result is cached in a hidden data member for future reuse.Note
Since Pandas frames do not play well with sparse representations, this will always return the data as a dense matrix. For very large data, this may consume much more memory, so use with care.
- metacells.utilities.annotation.get_vo_proper(adata: AnnData, name: str | ndarray | CompressedMatrix | PandasFrame | SparseMatrix = '__x__', *, layout: str | None = None, formatter: Callable[[Any], Any] | None = None) ndarray | CompressedMatrix [source]¶
Same as
get_vo_data
but returns ametacells.utilities.typing.ProperMatrix
.
- metacells.utilities.annotation.has_data(adata: AnnData, name: str, layout: str | None = None) bool [source]¶
Test whether we have the specified data.
If the data is per-variable-per-observation, and
layout
is specified (one ofrow_major
andcolumn_major
), it returns whether the specific data layout is available in the cache, without having to re-layout existing data.
Computation¶
Most of the functions defined here are thin wrappers around builtin numpy or scipy functions, with others wrapping C++ extensions provided as part of the metacells package itself.
The key distinction of the functions here is that they provide a uniform interface for all the
supported metacells.utilities.typing.Matrix
and
metacells.utilities.typing.Vector
types, which makes them safe to use in our code
without worrying about the exact data type used. In theory, Python duck-typing should have provided
this out of the box, but it seems that without explicit types and interfaces, the interfaces of the
different types diverge to the point where this just doesn’t work.
All the functions here (optionally) also allow collecting timing information using
metacells.utilities.timing
, to make it easier to locate the performance bottleneck of the
analysis pipeline.
- metacells.utilities.computation.allow_inefficient_layout(allow: bool) bool [source]¶
Specify whether to allow processing using an inefficient layout.
Returns the previous setting.
This is
True
by default, which merely warns when an inefficient layout is used. Otherwise, processing an inefficient layout is treated as an error (raises an exception).
- metacells.utilities.computation.to_layout(matrix: CompressedMatrix, layout: str, *, symmetric: bool = False) CompressedMatrix [source]¶
- metacells.utilities.computation.to_layout(matrix: ndarray, layout: str, *, symmetric: bool = False) ndarray
- metacells.utilities.computation.to_layout(matrix: PandasFrame | SparseMatrix, layout: str, *, symmetric: bool = False) ndarray | CompressedMatrix
Return the
matrix
in a specificlayout
for efficient processing.That is, if
layout
iscolumn_major
, re-layout the matrix for efficient per-column (variable, gene) slicing/processing. For sparse matrices, this iscsc
format; for dense matrices, this is Fortran (column-major) format.Similarly, if
layout
isrow_major
, re-layout the matrix for efficient per-row (observation, cell) slicing/processing. For sparse matrices, this iscsr
format; for dense matrices, this is C (row-major) format.If the matrix is already in the correct layout, it is returned as-is.
If the matrix is
symmetric
(default: False), it must be square and is assumed to be equal to its own transpose. This allows converting it from one layout to another using the efficient (essentially zero-cost) transpose operation.Otherwise, a new copy is created in the proper layout. This is a costly operation as it needs to move all the data elements to their proper place. This uses a C++ extension to deal with compressed data (the builtin implementation is much slower). Even so this operation is costly; still, it makes the following processing much more efficient, so it is typically a net performance gain overall.
- metacells.utilities.computation.sort_compressed_indices(matrix: CompressedMatrix, force: bool = False) None [source]¶
Efficient parallel sort of indices in a CSR/CSC
matrix
.This will skip sorting a matrix that is marked as sorted, unless
force
is specified.
- metacells.utilities.computation.corrcoef(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None, reproducible: bool) ndarray [source]¶
Similar to for
numpy.corrcoef
, but also works for a sparsematrix
, and can bereproducible
regardless of the number of cores used (at the cost of some slowdown). It only works for matrices with a float or double element data type.If
reproducible
, a slower (still parallel) but reproducible algorithm will be used.Unlike
numpy.corrcoef
, if given a row with identical values, instead of complaining about division by zero, this will report a zero correlation. This makes sense for the intended usage of computing similarities between cells/genes - an all-zero row has no data so we declare it to be “not similar” to anything else.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).Note
The result is always dense, as even for sparse data, the correlation is rarely exactly zero.
- metacells.utilities.computation.cross_corrcoef_rows(first_matrix: ndarray, second_matrix: ndarray, *, reproducible: bool) ndarray [source]¶
Similar to for
numpy.corrcoef
, but computes the correlations between each row of thefirst_matrix
and each row of thesecond_matrix
. The result matrix contains one row per row of the first matrix and one column per row of the second matrix. Both matrices must be dense, in row-major layout, have the same (float or double) element data type, and contain the same number of columns.If
reproducible
, a slower (still parallel) but reproducible algorithm will be used.Unlike
numpy.corrcoef
, if given a row with identical values, instead of complaining about division by zero, this will report a zero correlation. This makes sense for the intended usage of computing similarities between cells/genes - an all-zero row has no data so we declare it to be “not similar” to anything else.Note
This only works for floating-point matrices.
- metacells.utilities.computation.pairs_corrcoef_rows(first_matrix: ndarray, second_matrix: ndarray, *, reproducible: bool) ndarray [source]¶
Similar to for
numpy.corrcoef
, but computes the correlations between each row of thefirst_matrix
and each matching row of thesecond_matrix
. Both matrices must be dense, in row-major layout, have the same (float or double) element data type, and the same shape.If
reproducible
, a slower (still parallel) but reproducible algorithm will be used.Unlike
numpy.corrcoef
, if given a row with identical values, instead of complaining about division by zero, this will report a zero correlation. This makes sense for the intended usage of computing similarities between cells/genes - an all-zero row has no data so we declare it to be “not similar” to anything else.Note
This only works for floating-point matrices.
- metacells.utilities.computation.logistics(matrix: ndarray, *, location: float, slope: float, per: str | None) ndarray [source]¶
Compute a matrix of distances between each pair of rows in a dense (float or double) matrix using the logistics function.
The raw value of the logistics distance between a pair of vectors
x
andy
is the mean of1/(1+exp(-slope*(abs(x[i]-y[i])-location)))
. This has a minimum of1/(1+exp(slope*location))
for identical vectors and an (asymptotic) maximum of 1. We normalize this to a range between 0 and 1, to be useful as a distance measure (with a zero distance between identical vectors).If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).Note
The result is always dense, as even for sparse data, the result is rarely exactly zero.
- metacells.utilities.computation.cross_logistics_rows(first_matrix: ndarray, second_matrix: ndarray, *, location: float, slope: float) ndarray [source]¶
Similar to for
logistics()
, but computes the distances between each row of thefirst_matrix
and each row of thesecond_matrix
. The result matrix contains one row per row of the first matrix and one column per row of the second matrix. Both matrices must be dense, in row-major layout, have the same (float or double) element data type, and contain the same number of columns.
- metacells.utilities.computation.pairs_logistics_rows(first_matrix: ndarray, second_matrix: ndarray, *, location: float, slope: float) ndarray [source]¶
Similar to for
logistics()
, but computes the distances between each row of thefirst_matrix
and each matching row of thesecond_matrix
. Both matrices must be dense, in row-major layout, have the same (float or double) element data type, and the same shape.
- metacells.utilities.computation.log_data(shaped: S, *, base: float | None = None, normalization: float = 0) S [source]¶
Return the log of the values in the
shaped
data.If
base
is specified (default: None), use this base log. Otherwise, use the natural logarithm.The
normalization
(default: 0) specifies how to deal with zeros in the data:If it is zero, an input zero will become an output
NaN
.If it is positive, it is added to the input before computing the log.
If it is negative, input zeros will become log(minimal positive value) + normalization, that is, the zeros will be given a value this much smaller than the minimal “real” log value.
Note
The result is always dense, as even for sparse data, the log is rarely zero.
- metacells.utilities.computation.median_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Compute the mean value
per
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.mean_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Compute the mean value
per
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.nanmean_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Compute the mean value
per
(row
orcolumn
) of somematrix
, ignoringNone
values, if any.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.geomean_per(matrix: ndarray, *, per: str | None) ndarray [source]¶
Compute the geometric mean value
per
(row
orcolumn
) of some (dense)matrix
(of non-zero values).If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.max_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Compute the maximal value
per
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.nanmax_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Compute the maximal value
per
(row
orcolumn
) of somematrix
, ignoringNone
values, if any.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.min_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Compute the minimal value
per
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.nanmin_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Compute the minimal value
per
(row
orcolumn
) of somematrix
, ignoringNone
values, if any.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.nnz_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Compute the number of non-zero values
per
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).Note
If given a sparse matrix, this returns the number of structural non-zeros, that is, the number of entries we actually store data for, even if this data is zero. Use
metacells.utilities.typing.eliminate_zeros()
if you suspect the sparse matrix of containing structural zero data values.
- metacells.utilities.computation.sum_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Compute the total of the values
per
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.sum_squared_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Compute the total of the squared values
per
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.rank_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, rank: int, *, per: str | None) ndarray [source]¶
Get the
rank
elementper
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.top_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, top: int, *, per: str | None, ranks: bool = False) CompressedMatrix [source]¶
Get the
top
elementsper
(row
orcolumn
) of somematrix
, as a compressedper
-major matrix.If
ranks
(default: False), then fill the result with the rank of each element; Otherwise, just keep the original value.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.prune_per(compressed: CompressedMatrix, top: int) CompressedMatrix [source]¶
Keep just the
top
elements of somecompressed
matrix, per row for CSR and per column for CSC.
- metacells.utilities.computation.quantile_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, quantile: float, *, per: str | None) ndarray [source]¶
Get the
quantile
elementper
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.nanquantile_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, quantile: float, *, per: str | None) ndarray [source]¶
Get the
quantile
elementper
(row
orcolumn
) of somematrix
, ignoringNone
values.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.scale_by(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, scale: ndarray | Collection[int] | Collection[float] | PandasSeries, *, by: str) ndarray | CompressedMatrix [source]¶
Return a
matrix
where eachby
(row
orcolumn
) is scaled by the matching value of thevector
.
- metacells.utilities.computation.fraction_by(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, sums: ndarray | Collection[int] | Collection[float] | PandasSeries | None = None, by: str) ndarray | CompressedMatrix [source]¶
Return a matrix containing, in each entry, the fraction of the original data out of the total
by
(row
orcolumn
).That is, the sum of
by
in the result will be 1. However, ifsums
is specified, it is used instead of the sum of eachby
, so the sum of the results may be different.Note
This assumes all the data values are non-negative.
- metacells.utilities.computation.fraction_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Get the fraction
per
(row
orcolumn
) out of the total of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.stdev_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Get the standard deviantion
per
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.variance_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None) ndarray [source]¶
Get the variance
per
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.normalized_variance_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None, zero_value: float = 1.0) ndarray [source]¶
Get the normalized variance (variance / mean)
per
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).If all the values are zero, writes the
zero_value
(default: {zero_value}) into the result.
- metacells.utilities.computation.relative_variance_per(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str | None, window_size: int) ndarray [source]¶
Return the (log2(normalized_variance) - median(log2(normalized_variance_of_similar)) of the values
per
(row
orcolumn
) of somematrix
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).
- metacells.utilities.computation.sum_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) Any [source]¶
Compute the sum of all the values in a
matrix
.
- metacells.utilities.computation.nnz_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) Any [source]¶
Compute the number of non-zero entries in a
matrix
.Note
If given a sparse matrix, this returns the number of structural non-zeros, that is, the number of entries we actually store data for, even if this data is zero. Use
metacells.utilities.typing.eliminate_zeros()
if you suspect the sparse matrix of containing structural zero data values.
- metacells.utilities.computation.mean_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) Any [source]¶
Compute the mean of all the values in a
matrix
.
- metacells.utilities.computation.max_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) Any [source]¶
Compute the maximum of all the values in a
matrix
.
- metacells.utilities.computation.min_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) Any [source]¶
Compute the minimum of all the values in a
matrix
.
- metacells.utilities.computation.nanmean_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) Any [source]¶
Compute the mean of all the non-NaN values in a
matrix
.
- metacells.utilities.computation.nanmax_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) Any [source]¶
Compute the maximum of all the non-NaN values in a
matrix
.
- metacells.utilities.computation.nanmin_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) Any [source]¶
Compute the minimum of all the non-NaN values in a
matrix
.
- metacells.utilities.computation.rank_matrix_by_layout(matrix: ndarray, ascending: bool) Any [source]¶
Replace each element of the matrix with its rank (in row for
row_major
, in column forcolumn_major
).If
ascending
then rank 1 is the minimal element. Otherwise, rank 1 is the maximal element.
- metacells.utilities.computation.bincount_vector(vector: ndarray | Collection[int] | Collection[float] | PandasSeries, *, minlength: int = 0) ndarray [source]¶
Drop-in replacement for
numpy.bincount
, which is timed and works for anyvector
data.
- metacells.utilities.computation.most_frequent(vector: ndarray | Collection[int] | Collection[float] | PandasSeries) Any [source]¶
Return the most frequent value in a
vector
.This is useful for
metacells.tools.convey.convey_obs_to_group()
.
- metacells.utilities.computation.strongest(vector: ndarray | Collection[int] | Collection[float] | PandasSeries) Any [source]¶
Return the strongest (maximal absolute) value in a
vector
.This is useful for
metacells.tools.convey.convey_obs_to_group()
.
- metacells.utilities.computation.highest_weight(weights: ndarray | Collection[int] | Collection[float] | PandasSeries, vector: ndarray | Collection[int] | Collection[float] | PandasSeries) Any [source]¶
Return the value with the highest total
weights
in avector
.This is useful for
metacells.tools.project.convey_atlas_to_query()
.
- metacells.utilities.computation.weighted_mean(weights: ndarray | Collection[int] | Collection[float] | PandasSeries, vector: ndarray | Collection[int] | Collection[float] | PandasSeries) Any [source]¶
Return the weighted mean (using the
weights
and the values in thevector
).This is useful for
metacells.tools.project.convey_atlas_to_query()
.
- metacells.utilities.computation.fraction_of_grouped(value: Any) Callable[[ndarray | Collection[int] | Collection[float] | PandasSeries], Any] [source]¶
Return a function, that takes a vector and returns the fraction of elements of the vector which are equal to a specific
value
.
- metacells.utilities.computation.downsample_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str, samples: int | ndarray | Collection[int] | Collection[float] | PandasSeries, eliminate_zeros: bool = True, inplace: bool = False, random_seed: int) ndarray | CompressedMatrix [source]¶
Downsample the data
per
(one ofrow
andcolumn
) such that the sum of each one becomessamples
.If the matrix is sparse, and
eliminate_zeros
(default: True), then perform a final phase of eliminating leftover zero values from the compressed format. This means the result will be in “canonical format” so furtherscipy
sparse operations on it will be faster.If
inplace
(default: False), modify the matrix in-place, otherwise, return a modified copy.A non-zero
random_seed
will make the operation replicable.
- metacells.utilities.computation.downsample_vector(vector: ndarray | Collection[int] | Collection[float] | PandasSeries, samples: int, *, output: ndarray | None = None, random_seed: int) None [source]¶
Downsample a vector of sample counters.
Input
A numpy
vector
containing non-negative integer sample counts.A desired total number of
samples
.An optional numpy array
output
to hold the results (otherwise, the input is overwritten).A
random_seed
(non-zero for reproducible results).
The arrays may have any of the data types:
float32
,``float64``,``int32``,``int64``,``uint32``,``uint64``.Operation
If the total number of samples (sum of the array) is not higher than the required number of samples, the output is identical to the input.
Otherwise, treat the input as if it was a set where each index appeared its value number of times. Randomly select the desired number of samples from this set (without repetition), and store in the output the number of times each index was chosen.
- metacells.utilities.computation.matrix_rows_folds_and_aurocs(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, columns_subset: ndarray, columns_scale: ndarray | None = None, normalization: float) Tuple[ndarray, ndarray] [source]¶
Given a matrix and a subset of the columns, return two vectors. The first contains, for each row, the mean column value in the subset divided by the mean column value outside the subset. The second contains for each row the area under the receiver operating characteristic (AUROC) for the row, that is, the probability that a random column in the subset would have a higher value in this row than a random column outside the subset.
If
columns_scale
is specified, the data is divided by this scale before computing the AUROC.
- metacells.utilities.computation.sliding_window_function(vector: ndarray | Collection[int] | Collection[float] | PandasSeries, *, function: str, window_size: int, order_by: ndarray | None = None) ndarray [source]¶
Return a vector of the same size as the input
vector
, where each entry is the result of applying thefunction
(one ofmean
,median
,std
,var
) to a sliding window of sizewindow_size
centered on the entry.If
order_by
is specified, thevector
is first sorted by this order, and the end result is unsorted back to the original order. That is, the sliding window centered at each position will contain thewindow_size
of entries which have the nearestorder_by
values to the center entry.Note
The window size should be an odd positive integer. If an even value is specified, it is automatically increased by one.
- metacells.utilities.computation.patterns_matches(patterns: str | Pattern | Collection[str | Pattern], strings: Collection[str], invert: bool = False) ndarray [source]¶
Given a collection of (case-insensitive)
strings
, return a boolean mask specifying which of them match the given regular expressionpatterns
.If
invert
(default: {invert}), invert the mask.
- metacells.utilities.computation.compress_indices(indices: ndarray | Collection[int] | Collection[float] | PandasSeries) ndarray [source]¶
Given a vector of group
indices
per element, return a vector where the group indices are consecutive.If the group indices contain
-1
(“outliers”), then it is preserved as-1
in the result.
- metacells.utilities.computation.bin_pack(element_sizes: ndarray | Collection[int] | Collection[float] | PandasSeries, max_bin_size: float) ndarray [source]¶
Given a vector of
element_sizes
return a vector containing the bin number for each element, such that the total size of each bin is at most, and as close to as possible, to themax_bin_size
.This uses the first-fit decreasing algorithm for finding an initial solution and then moves elements around to minimize the l2 norm of the wasted space in each bin.
- metacells.utilities.computation.bin_fill(element_sizes: ndarray | Collection[int] | Collection[float] | PandasSeries, min_bin_size: float) ndarray [source]¶
Given a vector of
element_sizes
return a vector containing the bin number for each element, such that the total size of each bin is at least, and as close to as possible, to themin_bin_size
.This uses the first-fit decreasing algorithm for finding an initial solution and then moves elements around to minimize the l2 norm of the wasted space in each bin.
- metacells.utilities.computation.sum_groups(matrix: ndarray | CompressedMatrix, groups: ndarray | Collection[int] | Collection[float] | PandasSeries, *, per: str | None, transform: Callable[[ndarray | CompressedMatrix | PandasFrame | SparseMatrix], ndarray | CompressedMatrix | PandasFrame | SparseMatrix] | None = None) Tuple[ndarray, ndarray] | None [source]¶
Given a
matrix
, and a vector ofgroups
per
column or row, return a matrix with a column or rowper
group, containing the sum of the groups columns or rows, and a vector of sizes (the number of summed columns or rows)per
group.Negative group indices (“outliers”) are ignored and their data is not included in the result. If there are no non-negative group indices, returns
None
.If
per
isNone
, the matrix must be square and is assumed to be symmetric, so the most efficient direction is used based on the matrix layout. Otherwise it must be one ofrow
orcolumn
, and the matrix must be in the appropriate layout (row_major
operating on rows,column_major
for operating on columns).If
transform
is notNone
, it is applied to the data before summing it.
- metacells.utilities.computation.shuffle_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, per: str, random_seed: int) None [source]¶
Shuffle (in-place) the
matrix
dataper
column or row.The matrix must be in the appropriate layout (
row_major
for shuffling data in each row,column_major
for shuffling data in each column).A non-zero
random_seed
(non-zero for reproducible results) will make the operation replicable.
- metacells.utilities.computation.cover_diameter(*, points_count: int, area: float, cover_fraction: float) float [source]¶
Return the diameter to give to each point so that the total area of
points_count
will be acover_fraction
of the totalarea
.
- metacells.utilities.computation.cover_coordinates(x_coordinates: ndarray | Collection[int] | Collection[float] | PandasSeries, y_coordinates: ndarray | Collection[int] | Collection[float] | PandasSeries, *, cover_fraction: float = 0.3333333333333333, noise_fraction: float = 1.0, random_seed: int) Tuple[ndarray, ndarray] [source]¶
Given x/y coordinates of points, move them so that the total area covered by them is
cover_fraction
(default: 0.3333333333333333) of the total area of their bounding box, assuming each has the diameter of their minimal distance. The points are jiggled around by thenoise_fraction
of their minimal distance using therandom_seed
(non-zero for reproducible results).Returns new x/y coordinates vectors.
- metacells.utilities.computation.random_piles(elements_count: int, target_pile_size: int, *, random_seed: int) ndarray [source]¶
Split
elements_count
elements into piles of a size roughly equal totarget_pile_size
.Return a vector specifying the pile index of each element.
Specify a non-zero
random_seed
to make this replicable.
- metacells.utilities.computation.represent(goal: ndarray, basis: ndarray) Tuple[float, ndarray] | None [source]¶
Represent a
goal
vector as a weighted average of the row vectors of somebasis
matrix.This computes a non-negative weight for each matrix row, such that the sum of weights is 1, minimizing the distance (L2 norm) between the goal vector and the weighted average of the basis vectors. This is a convex problem quadratic subject to a linear constraint, so
cvxpy
solves it efficiently.The return value is a tuple with the score of the weights vector, and the weights vector itself.
- metacells.utilities.computation.min_cut(weights: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) Tuple[Cut, float | None] [source]¶
Find the minimal cut that will split an undirected graph (with a symmetrical
weights
matrix).Returns the
igraph.Cut
object describing the cut, and the scale-invariant strength of the cut edges. This strength is the ratio between the mean weight of an edge connecting a random node in each partition and the mean weight of an edge connecting two random nodes inside a random partition. If either of the partitions contains no edges (e.g. contains a single node), the strength will beNone
.
- metacells.utilities.computation.sparsify_matrix(full: ndarray | CompressedMatrix, min_column_max_value: float, min_entry_value: float, abs_values: bool) CompressedMatrix [source]¶
Given a full matrix, return a sparse matrix such that all non-zero entries are at least
min_entry_value
, and columns that have no value abovemin_column_max_value
are set to all-zero. Ifabs_values
consider the absolute values when comparing to the thresholds.
Typing¶
The code has to deal with many different alternative data types for what is essentially two basic data types: 2D matrices and 1D vectors.
Specifically, we have pandas data frames and series, Scipy sparse matrices, and numpy multi-dimensional arrays (not to mention the deprecated numpy matrix type).
Python has the great ability to “duck type”, so in an ideal world, we could just pretend these are just two data types and be done. In practice, this is hopelessly broken.
First, even operations that exists for all data types sometimes have different interfaces
(as in, np.foo(matrix, ...)
vs. matrix.foo(...)
).
Second, operating on sparse and dense data often requires completely different code paths.
This makes it very easy to write code that works today and breaks tomorrow when someone passes a pandas series to a function that expects a numpy array and it just almost works correctly (and god help the poor soul that mixes up a numpy matrix with a numpy 2d array, or passes a categorical pandas series to something that expects a series of strings).
“Eternal vigilance is the cost of freedom” - the solution here is to define a bunch of fake types,
which are almost entirely for the benefit of the mypy
type checker (with some run-time
assertions as well).
This not only makes the code intent explicit (“explicit is better than implicit”) but also allows us
to leverage mypy
to catch errors such as applying a numpy operation on a sparse matrix, etc.
To put some order in this chaos, the following concepts are used:
Shaped
is any 1d or 2d data in any format we can work with.Matrix
is any 2d data, andVector
is any 1d data.For 2D data, we allow multiple data types that we can’t directly operate on: most
SparseMatrix
layouts,PandasFrame
andnp.matrix
have strange quirks when it comes to directly operating on them and should be avoided, while CSR and CSCCompressedMatrix
sparse matrices and properly-laid-out 2D numpy arraysNumpyMatrix
are in general well-behaved. We therefore introduce the concept ofProperMatrix
vs.ImproperMatrix
types, and provide functions that manipulate whether the “proper” data is in row-major or column-major order.For 1D data, we just distinguish between
PandasSeries
and 1D numpyNumpyVector
arrays, as these are the only types we allow. In theory we could have also allowed for sparse vectors but mercifully these are very uncommon so we can just ignore them.
Ironically, now that numpy
added type annotations, the usefulness of the type hints added here
has decreased, since both NumpyVector
and NumpyMatrix
are aliases to the
same numpy.ndaarray
type. Perhaps in the future numpy would allow for using Annotated
types
(with explicit number of dimensions, or even - gasp - the element data type) to allow for more
useful type annotations. Or this could all be ported to Julia and avoid this whole mess.
- metacells.utilities.typing.CPP_DATA_TYPES = ['float32', 'float64', 'int32', 'int64', 'uint32', 'uint64']¶
The data types supported by the C++ extensions code.
- metacells.utilities.typing.Shaped¶
Shaped data of any of the types we can deal with.
alias of
Union
[ndarray
,CompressedMatrix
,PandasFrame
,SparseMatrix
,Collection
[int
],Collection
[float
],PandasSeries
]
- metacells.utilities.typing.ProperShaped¶
A “proper” 1- or 2-dimensional data.
alias of
Union
[ndarray
,CompressedMatrix
]
- metacells.utilities.typing.ImproperShaped¶
An “improper” 1- or 2- dimensional data.
alias of
Union
[PandasFrame
,SparseMatrix
,Collection
[int
],Collection
[float
],PandasSeries
]
- metacells.utilities.typing.Matrix¶
A
mypy
type for any 2-dimensional data.alias of
Union
[ndarray
,CompressedMatrix
,PandasFrame
,SparseMatrix
]
- metacells.utilities.typing.ProperMatrix¶
A
mypy
type for “proper” 2-dimensional data.“Proper” data allows for direct processing without having to mess with its formatting.
alias of
Union
[ndarray
,CompressedMatrix
]
- metacells.utilities.typing.NumpyMatrix¶
Numpy 2-dimensional data.
Note
This is not to be confused with
numpy.matrix
which must not be used, but is returned by the occasional function, and would wreak havoc on the semantics of some operations unless immediately concerted to a properNumpyMatrix
, which is a simple 2-dimensionalndarray
.
- class metacells.utilities.typing.CompressedMatrix[source]¶
A
mypy
type for sparse CSR/CSC 2-dimensional data.Should have been
CompressedMatrix = sp..._cs_matrix
.
- metacells.utilities.typing.ImproperMatrix¶
A
mypy
type for “improper” 2-dimensional data.“Improper” data contains or can be converted to “proper” data.
alias of
Union
[PandasFrame
,SparseMatrix
]
- class metacells.utilities.typing.SparseMatrix[source]¶
A
mypy
type for sparse 2-dimensional data.Should have been
SparseMatrix = sp.base.spmatrix
.
- class metacells.utilities.typing.PandasFrame[source]¶
A
mypy
type for pandas 2-dimensional data.Should have been
PandasFrame = pd.DataFrame
.
- metacells.utilities.typing.Vector¶
A
mypy
type for any 1-dimensional data.alias of
Union
[ndarray
,Collection
[int
],Collection
[float
],PandasSeries
]
- metacells.utilities.typing.NumpyVector¶
Numpy 1-dimensional data.
- metacells.utilities.typing.ImproperVector¶
An “improper” 1-dimensional data.
alias of
Union
[Collection
[int
],Collection
[float
],PandasSeries
]
- class metacells.utilities.typing.PandasSeries[source]¶
A
mypy
type for pandas 1-dimensional data.Should have been
PandasSeries = pd.Series
.
- metacells.utilities.typing.is_1d(shaped: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries) bool [source]¶
Test whether the
shaped
is 1-dimensional.
- metacells.utilities.typing.is_2d(shaped: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries) bool [source]¶
Test whether the
shaped
is 2-dimensional.
- metacells.utilities.typing.maybe_numpy_vector(shaped: Any) ndarray | None [source]¶
Return the
shaped
as aNumpyVector
, if it is one.
- metacells.utilities.typing.maybe_numpy_matrix(shaped: Any) ndarray | None [source]¶
Return the
shaped
as aNumpyMatrix
, if it is one.Note
This looks for a 2-dimensional
numpy.ndarray
which is not anumpy.matrix
. Do not usenumpy.matrix
- it is deprecated and behaves subtly different to a 2-dimensionalnumpy.ndarray
leading to hard-to-find bugs.
- metacells.utilities.typing.maybe_sparse_matrix(shaped: Any) SparseMatrix | None [source]¶
Return
shap
as aSparseMatrix
, if it is one.Note
This will succeed for a
CompressedMatrix
which is a sub-type of aSparseMatrix
.
- metacells.utilities.typing.maybe_compressed_matrix(shaped: Any) CompressedMatrix | None [source]¶
Return
shaped
as aCompressedMatrix
, if it is one.
- metacells.utilities.typing.maybe_pandas_frame(shaped: Any) PandasFrame | None [source]¶
Return
shaped
s aPandasFrame
, if it is one.
- metacells.utilities.typing.maybe_pandas_series(shaped: Any) PandasSeries | None [source]¶
Return
shaped
as aPandasSeries
, if it is one.
- metacells.utilities.typing.mustbe_numpy_vector(shaped: Any) ndarray [source]¶
Return
shaped
as aNumpyVector
, asserting it must be one.
- metacells.utilities.typing.mustbe_numpy_matrix(shaped: Any) ndarray [source]¶
Return
shaped
as aNumpyMatrix
, asserting it must be one.Note
This looks for a 2-dimensional
numpy.ndarray
which is not anumpy.matrix
. Do not usenumpy.matrix
- it is deprecated and behaves subtly different to a 2-dimensionalnumpy.ndarray
leading to hard-to-find bugs.
- metacells.utilities.typing.mustbe_sparse_matrix(shaped: Any) SparseMatrix [source]¶
Return
shaped
as aSparseMatrix
, asserting it must be one.Note
This will succeed for a
CompressedMatrix
which is a sub-type of aSparseMatrix
.
- metacells.utilities.typing.mustbe_compressed_matrix(shaped: Any) CompressedMatrix [source]¶
Return
shaped
as aCompressedMatrix
, asserting it must be one.
- metacells.utilities.typing.mustbe_pandas_frame(shaped: Any) PandasFrame [source]¶
Return
shaped
as aPandasFrame
, asserting it must be one.
- metacells.utilities.typing.mustbe_pandas_series(shaped: Any) PandasSeries [source]¶
Return
shaped
as aPandasSeries
, asserting it must be one.
- metacells.utilities.typing.to_proper_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, default_layout: str = 'row_major') ndarray | CompressedMatrix [source]¶
Given some 2D
matrix
, return in in aProperMatrix
format we can safely process.If the data is in some strange sparse format, use
default_layout
(default: row_major) to decide whether to return it inrow_major
(CSR) orcolumn_major
(CSC) layout.
- metacells.utilities.typing.to_proper_matrices(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, default_layout: str = 'row_major') Tuple[ndarray | CompressedMatrix, ndarray | None, CompressedMatrix | None] [source]¶
Similar to
to_proper_matrix()
but return a tuple with the proper matrix and also itsNumpyMatrix
representation and its py:const:CompressedMatrix representation. Exactly one of these two representations will beNone
.If the data is in some strange sparse format, use
default_layout
(default: {default_layout}) to decide whether to return it inrow_major
(CSR) orcolumn_major
(CSC) layout.This is used to pick between dense and compressed code paths, and provides typed references so
mypy
can type-check each of these paths:proper, dense, compressed = to_proper_matrices(matrix) ... Common code path can use the proper matrix value ... if dense is not None: assert compressed is None ... Dense code path can use the dense matrix ... else: assert compressed is not None ... Compressed code path can use the compressed matrix value ... if metacells.ut.matrix_layout(compressed) == 'row_major': ... CSR code path ... else: ... CSC code path ...
- metacells.utilities.typing.to_pandas_series(vector: ndarray | Collection[int] | Collection[float] | PandasSeries | None = None, *, index: ndarray | Collection[int] | Collection[float] | PandasSeries | None = None) PandasSeries [source]¶
Construct a pandas series from any
Vector
.
- metacells.utilities.typing.to_pandas_frame(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | None = None, *, index: ndarray | Collection[int] | Collection[float] | PandasSeries | None = None, columns: ndarray | Collection[int] | Collection[float] | PandasSeries | None = None) PandasFrame [source]¶
Construct a pandas frame from any
Matrix
.
- metacells.utilities.typing.frozen(shaped: ndarray | CompressedMatrix | PandasFrame | PandasSeries) bool [source]¶
Test whether the
shaped
data is protected against future modification.
- metacells.utilities.typing.freeze(shaped: ndarray | CompressedMatrix | PandasFrame | PandasSeries) None [source]¶
Protect the
shaped
data against future modification.
- metacells.utilities.typing.unfreeze(shaped: ndarray | CompressedMatrix | PandasFrame | PandasSeries) None [source]¶
Permit future modification of some
shaped
data.
- metacells.utilities.typing.unfrozen(proper: ndarray | CompressedMatrix) Iterator[None] [source]¶
Execute some in-place modification, temporarily unfreezing the
proper
shaped data.
- metacells.utilities.typing.to_numpy_matrix(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, *, default_layout: str = 'row_major', copy: bool = False, only_extract: bool = False) ndarray [source]¶
Convert any
Matrix
to a dense 2-dimensionalNumpyMatrix
.If
copy
(default: False), a copy of the data is returned even if no conversion needed to be done.If
only_extract
(default: False), then assert this only extracts the data inside some pandas data.If the data is in some strange sparse format, use
default_layout
(default: row_major) to decide whether to return it inrow_major
(CSR) orcolumn_major
(CSC) layout.
- metacells.utilities.typing.to_numpy_vector(shaped: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries, *, copy: bool = False, only_extract: bool = False) ndarray [source]¶
Convert any
Vector
, or aMatrix
where one of the dimensions has size one, to aNumpyVector
.If
copy
(default: False), a copy of the data is returned even if no conversion needed to be done.If
only_extract
(default: False), then assert this only extracts the data inside some pandas data.
- metacells.utilities.typing.DENSE_FAST_FLAG = {'column_major': 'F_CONTIGUOUS', 'row_major': 'C_CONTIGUOUS'}¶
Which flag indicates efficient 2D dense matrix layout.
- metacells.utilities.typing.SPARSE_FAST_FORMAT = {'column_major': 'csc', 'row_major': 'csr'}¶
Which format indicates efficient 2D sparse matrix layout.
- metacells.utilities.typing.SPARSE_SLOW_FORMAT = {'column_major': 'csr', 'row_major': 'csc'}¶
Which format indicates inefficient 2D sparse matrix layout.
- metacells.utilities.typing.LAYOUT_OF_AXIS = ('row_major', 'column_major')¶
The layout by the
axis
parameter.
- metacells.utilities.typing.PER_OF_AXIS = ('row', 'column')¶
When reducing data, get results
per
row or column (by theaxis
parameter).
- metacells.utilities.typing.shaped_dtype(shaped: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries) str [source]¶
Return the data type of the element of shaped data.
- metacells.utilities.typing.matrix_layout(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix) str | None [source]¶
Return which layout the
matrix
is arranged by (row_major
orcolumn_major
).If the data is in some strange sparse format, returns
None
.
- metacells.utilities.typing.is_layout(matrix: ndarray | CompressedMatrix | PandasFrame | SparseMatrix, layout: str | None) bool [source]¶
Test whether the
matrix
is arranged according to thelayout
.This will always succeed if the
layout
isNone
.
- metacells.utilities.typing.is_contiguous(vector: ndarray | Collection[int] | Collection[float] | PandasSeries) bool [source]¶
Return whether the
vector
is contiguous in memory.This is only
True
for a dense vector.
- metacells.utilities.typing.to_contiguous(vector: ndarray | Collection[int] | Collection[float] | PandasSeries, *, copy: bool = False) ndarray [source]¶
Return the
vector
in contiguous (dense) format.If
copy
(default: {copy}), a copy of the data is returned even if no conversion needed to be done.
- metacells.utilities.typing.mustbe_canonical(shaped: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries) None [source]¶
Assert that some data is in canonical format.
For numpy matrices or vectors, this means the data is contiguous (for matrices, in either row-major or column-major order).
For sparse matrices, it means the data is in COO format, or compressed (CSC or CSR format), with sorted indices and no duplicates.
In general, we’d like all the data stored in
AnnData
to be canonical.
- metacells.utilities.typing.is_canonical(shaped: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries) bool [source]¶
Return whether the data is in canonical format.
For numpy matrices or vectors, this means the data is contiguous (for matrices, in either row-major or column-major order).
For sparse matrices, it means the data is in COO format, or compressed (CSC or CSR format), with sorted indices and no duplicates.
In general, we’d like all the data stored in
AnnData
to be canonical.
- metacells.utilities.typing.eliminate_zeros(compressed: CompressedMatrix) None [source]¶
Eliminate zeros in a compressed matrix.
- metacells.utilities.typing.sort_indices(compressed: CompressedMatrix) None [source]¶
Ensure the indices are sorted in each row/column.
- metacells.utilities.typing.sum_duplicates(compressed: CompressedMatrix) None [source]¶
Eliminate duplicates in a compressed matrix.
- metacells.utilities.typing.shaped_checksum(shaped: ndarray | CompressedMatrix | PandasFrame | SparseMatrix | Collection[int] | Collection[float] | PandasSeries) float [source]¶
Return a checksum of the contents of
shaped
data (for debugging reproducibility).
Parallel¶
Due to the notorious GIL, using multiple Python threads is essentially useless. This leaves us with two options for using multiple processors, which is mandatory for reasonable performance on the large data sets we work on:
Use multiple threads in the internal C++ implementation of some Python functions; this is done by both numpy and the C++ extension functions provided by this package, and works even for reasonably small sized work, such as sorting each of the rows of a large matrix.
Use Python multi-processing. This is costly and works only for large sized work, such as computing metacells for different piles.
Each of these two approaches works tolerably well on its own, even though both are sub-optimal. The
problem starts when we want to combine them. Consider a server with 50 processors. Invoking
corrcoef
on a large matrix will use them all. This is great if one computes metacells for a
single pile. Suppose, however, you want to compute metacells for 50 piles, and do so using
multi-processing. Each and every of the 50 sub-processes will invoke corcoeff
which will spawn
50 internal threads, resulting in the operating system seeing 2500 processes competing for the same
50 hardware processors. “This does not end well.”
You would expect that, two decades after multi-core systems became available, this would have been solved “out of the box” by the parallel frameworks (Python, OpenMP, TBB, etc.) all agreeing to cooperate with each other. However, somehow this isn’t seen as important by the people maintaining these frameworks; in fact, most of them don’t properly handle nested parallelism within their own framework, never mind playing well with others.
So in practice, while languages built for parallelism (such as Julia and Rust) deal well with nested parallel construct, using a mixture of older serial languages (such as Python and C++) puts us in a swamp, and “you can’t build a castel in a swamp”. In our case, numpy uses some underlying parallel threads framework, our own extensions uses OpenMP parallel threads, and we are forced to use the Python-multi-processing framework itself on top of both, and each of these frameworks is blind to the others.
As a crude band-aid, we force both whatever-numpy-uses and OpenMP to use a specific number of threads. So, when we use multi-processing, we limit each sub-process to use less internal threads, such that the total will be at most 50. This very sub-optimal, but at least it doesn’t bring the server to its knees trying to deal with a total load of 2500 processes.
A final twist on all this is that hyper-threading is (worse than) useless for heavy compute threads.
We therefore by default only use one thread per physical cores. We get the number pf physical cores
using the psutil
package.
- metacells.utilities.parallel.is_main_process() bool [source]¶
Return whether this is the main process, as opposed to a sub-process spawned by
parallel_map()
.
- metacells.utilities.parallel.set_processors_count(processors: int) None [source]¶
Set the (maximal) number of processors to use in parallel.
The default value of
0
means using all the available physical processors. Note that if hyper-threading is enabled, this would be less than (typically half of) the number of logical processors in the system. This is intentional, as there’s no value - actually, negative value - in running multiple heavy computations on hyper-threads of the same physical processor.Otherwise, the value is the actual (positive) number of processors to use. Override this by setting the
METACELLS_PROCESSORS_COUNT
environment variable or by invoking this function from the main thread.
- metacells.utilities.parallel.get_processors_count() int [source]¶
Return the number of PROCESSORs we are allowed to use.
- metacells.utilities.parallel.parallel_map(function: Callable[[int], T], invocations: int, *, max_processors: int = 0, hide_from_progress_bar: bool = False) List[T] [source]¶
Execute
function
, in parallel,invocations
times. Each invocation is given the invocation’s index as its single argument.For our simple pipelines, only the main process is allowed to execute functions in parallel processes, that is, we do not support nested
parallel_map
calls.This uses
get_processors_count()
processes. Ifmax_processors
(default: 0) is zero, use all available processors. Otherwise, further reduces the number of processes used to at most the specified value.If this ends up using a single process, runs the function serially. Otherwise, fork new processes to execute the function invocations (using
multiprocessing.get_context('fork').Pool.map
).The downside is that this is slow, and you need to set up mutable shared memory (e.g. for large results) in advance. The upside is that each of these processes starts with a shared memory copy(-on-write) of the full Python state, that is, all the inputs for the function are available “for free”.
If a progress bar is active at the time of invoking
parallel_map
, andhide_from_progress_bar
is not set, then it is assumed the parallel map will cover all the current (slice of) the progress bar, and it is reported into it in increments of1/invocations
.
Progress¶
This used tqdm
to provide a progress bar while computing the metacells.
- metacells.utilities.progress.progress_bar(**tqdm_kwargs: Any) Any [source]¶
Run some code with a
tqdm
progress bar...note:
When a progress bar is active, logging is restricted to warnings and errors.
- metacells.utilities.progress.progress_bar_slice(fraction: float | None) Any [source]¶
Run some code which will use a slice of the current progress bar.
This can be nested to split the overall progress bar into smaller and smaller parts to represent a tree of computations.
If
fraction
is None, or there is no active progress bar, simply runs the code.
- metacells.utilities.progress.did_progress(fraction: float) Any [source]¶
Report progress of some fraction of the current (slice of) progress bar.
- metacells.utilities.progress.has_progress_bar() bool [source]¶
Return whether there is an active progress bar.
- metacells.utilities.progress.start_progress_bar(**tqdm_kwargs: Any) Any [source]¶
Create a progress bar (but do not show it yet).
- metacells.utilities.progress.start_progress_bar_slice(fraction: float) Tuple[int, int] [source]¶
Start a nested slice of the overall progress bar.
Returned the captured state that needs to be passed to
end_progress_bar_slice
.
- metacells.utilities.progress.end_progress_bar_slice(old_state: Tuple[int, int]) None [source]¶
End a nested slice of the overall progress bar.
This moves the progress bar position to the end of the slice regardless of reported progress within it.
Timing¶
The first step in achieving reasonable performance is identifying where most of the time is being spent. The functions in this module allow to easily collect timing information about the relevant functions or steps within functions in a controlled way, with low overhead, as opposed to collecting information about all functions which has higher overheads and produces mountains of mostly irrelevant data.
- metacells.utilities.timing.collect_timing(collect: bool, path: str = 'timing.csv', mode: str = 'a', *, buffering: int = 1) None [source]¶
Specify whether, where and how to collect timing information.
By default, we do not. Override this by setting the
METACELLS_COLLECT_TIMING
environment variable totrue
, or by invoking this function from the main thread.By default, the data is written to the
path
is timing.csv, which is opened with the mode is a and using the buffering is 1. Override this by setting theMETACELL_TIMING_PATH
,METACELL_TIMING_MODE
and/or theMETACELL_TIMING_BUFFERING
environment variables, or by invoking this function from the main thread.This will flush and close the previous timing file, if any.
The file is written in CSV format (without headers). The first three fields are:
The invocation context (a
.
-separated path of “relevant” function/step names).The elapsed time (in nanoseconds) in this context (not counting nested contexts).
The CPU time (in nanoseconds) in this context (not counting nested contexts).
This may be followed by a series of
name,value
pairs describing parameters of interest for this context, such as data sizes and layouts, to help understand the performance of the code.
- metacells.utilities.timing.flush_timing() None [source]¶
Flush the timing information, if we are collecting it.
- metacells.utilities.timing.in_parallel_map(map_index: int, process_index: int) None [source]¶
Reconfigure timing collection when running in a parallel sub-process via
metacells.utilities.parallel.parallel_map()
.This will direct the timing information from
<timing>.csv
to<timing>.<map>.<process>.csv
(where<timing>
is from the original path,<map>
is the serial number of themetacells.utilities.parallel.parallel_map()
invocation, and<process>
is the serial number of the process in the map).Collecting the timing of separate sub-processes to separate files allows us to freely write to them without locks and synchronizations which improves the performance (reduces the overhead of collecting timing information).
You can just concatenate the files when the run is complete, or use a tool which automatically collects the data from all the files, such as
metacells.scripts.timing
.
- metacells.utilities.timing.log_steps(log: bool) None [source]¶
Whether to log every step invocation.
By default, we do not. Override this by setting the
METACELLS_LOG_ALL_STEPS
environment variable totrue
or by invoking this function from the main thread.Note
This only works if
collect_timing()
was set. It is a crude instrument to hunt for deadlocks, very-long-running numpy functions, and the like. Basically, if the program is taking 100% CPU and you have no idea what it is doing, turning this on and looking at the last logged step name would give you some idea of where it is stuck.
- metacells.utilities.timing.timed_step(name: str) Iterator[None] [source]¶
Collect timing information for a computation step.
Expected usage is:
with ut.timed_step("foo"): some_computation()
If we are collecting timing information, then for every invocation, the program will append a line similar to:
foo,elapsed_ns,123,cpu_ns,456
To a timing log file (default:
timing.csv
). Additional fields can be appended to the line using themetacells.utilities.timing.parameters
function.If the
name
starts with a.
of a_
, then it is prefixed with the names of the innermost surrounding step name (which must exist). This is commonly used to time sub-steps of a function.
- metacells.utilities.timing.timed_call(name: str | None = None) Callable[[CALLABLE], CALLABLE] [source]¶
Automatically wrap each invocation of the decorated function with
metacells.utilities.timing.timed_step()
using thename
(by default, the function’s__qualname__
).Expected usage is:
@ut.timed_call() def some_function(...): ...
- metacells.utilities.timing.timed_parameters(**kwargs: Any) None [source]¶
Associate relevant timing parameters to the innermost
metacells.utilities.timing.timed_step()
.The specified arguments are appended at the end of the generated
timing.csv
line. For example,timed_parameters(foo=2, bar=3)
would addfoo,2,bar,3
to the line intiming.csv
.This allows tracking parameters which affect invocation time (such as array sizes), to help identify the causes for the long-running operations.
- metacells.utilities.timing.context() str [source]¶
Return the full current context (path of
metacells.utilities.timing.timed_step()
-s leading to the current point).Note
The context will be the empty string unless we are actually collecting timing.
- metacells.utilities.timing.current_step() StepTiming | None [source]¶
The timing collector for the innermost (current)
metacells.utilities.timing.timed_step()
, if any.
- class metacells.utilities.timing.StepTiming(name: str, parent: StepTiming | None)[source]¶
Timing information for some named processing step.
- parent¶
The parent step, if any.
- context: str¶
The full context of the processing step.
- parameters: List[str]¶
Parameters of interest of the processing step.
- thread_name¶
The thread the step was invoked in.
- total_nested¶
The amount of CPU used in nested steps in the same thread.
- class metacells.utilities.timing.Counters(*, elapsed_ns: int = 0, cpu_ns: int = 0)[source]¶
The counters for the execution times.
- elapsed_ns¶
Elapsed time counter.
- cpu_ns¶
CPU time counter.
Logging¶
This provides a useful formatter which includes high-resolution time and thread names, and a set of utility functions for effective logging of operations on annotation data.
Collection of log messages is mostly automated by wrapping relevant function calls and tracing the
setting and getting of data via the metacells.utilities.annotation
accessors, with the
occasional explicit logging of a notable intermediate calculated value via log_calc()
.
The ging is picking the correct level for each log message. This module provides the following log levels which hopefully provide the end user with a reasonable amount of control:
INFO
will log only setting of the final results as annotations within the top-levelAnnData
object(s).STEP
will also log the top-level algorithm step(s), which give a very basic insight into what was executed.PARAM
will also log the parameters of these steps, which may be important when tuning the behavior of the system for different data sets.CALC
will also log notable intermediate calculated results, which again may be important when tuning the behavior of the system for different data sets.DEBUG
pulls all the stops and logs all the above, not only for the top-level steps, but also for the nested processing steps. This results in a rather large log file (especially for the recursive divide-and-conquer algorithm). You don’t need this except for when you really need this.
To achieve this, we track for each AnnData
whether it is a top-level (user visible) or a
temporary data object, and whether we are inside a top-level (user invoked) or a nested operation.
Accessing top-level data and invoking top-level operations is logged at the coarse logging levels,
anything else is logged at the DEBUG
level.
To improve the log messages, we allow each AnnData
object to have an optional name for logging
(see metacells.utilities.annotation.set_name()
and
metacells.utilities.annotation.get_name()
). Whenever a temporary AnnData
data is
created, its name is extended by some descriptive suffix, so we get names like
full.clean.select
to describe the data selected from the clean data extracted out of
the full data.
- metacells.utilities.logging.setup_logger(*, level: int = 20, to: ~typing.IO = <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>, time: bool = False, process: bool | None = None, name: str | None = None, long_level_names: bool | None = None) Logger [source]¶
Setup the global
logger()
.Note
A second call will fail as the logger will already be set up.
If
level
is not specified, onlyINFO
messages (setting values in the annotated data) will be logged.If
to
is not specified, the output is sent tosys.stderr
.If
time
(default: False), include a millisecond-resolution timestamp in each message.If
name
(default: None) is specified, it is added to each message.If
process
(default: None), include the (sub-)process index in each message. The name of the main process (thread) is replaced to#0
to make it more compatible with the sub-process names (#<map-index>.<sub-process-index>
).If
process
isNone
, and if the logging level is higher thanINFO
, andmetacells.utilities.parallel.get_processors_count()
is greater than one, thenprocess
is set - that is, it will be set if we expect to see log messages from multiple sub-processes.Logging from multiple sub-processes (e.g., using (e.g., using
metacells.utilities.parallel.parallel_map()
) will synchronize using a global lock so messages will not get garbled.If
long_level_names
(default: None), includes the log level in each message. If isFalse
, the log level names are shortened to three characters, for consistent formatting of indented (nested) log messages. If it isNone
, no level names are logged at all.
- metacells.utilities.logging.logger() Logger [source]¶
Access the global logger.
If
setup_logger()
has not been called yet, this will call it using the default flags. You should therefore callsetup_logger()
as early as possible to ensure you don’t end up with a misconfigured logger.
- metacells.utilities.logging.CALC = 12¶
The log level for tracing intermediate calculations.
- metacells.utilities.logging.STEP = 17¶
The log level for tracing processing steps.
- metacells.utilities.logging.PARAM = 15¶
The log level for tracing parameters.
- metacells.utilities.logging.logged(**kwargs: Callable[[Any], Any]) Callable[[CALLABLE], CALLABLE] [source]¶
Automatically wrap each invocation of the decorated function with logging it. Top-level calls are logged using the
STEP
log level, with parameters logged at thePARAM
log level. Nested calls are logged at theDEBUG
log level.By default parameters are logged by simply converting them to a string, with special cases for
AnnData
, callable functions, boolean masks, vectors and matrices. You can override this by specifyingparameter_name=convert_value_to_logged_value
for the specific parameter.Expected usage is:
@ut.logged() def some_function(...): ...
- metacells.utilities.logging.top_level(adata: AnnData) None [source]¶
Indicate that the annotated data will be returned to the top-level caller, increasing its logging level.
- metacells.utilities.logging.log_return(name: str, value: Any, *, formatter: Callable[[Any], Any] | None = None) bool [source]¶
Log a
value
returned from a function with somename
.If
formatter
is specified, use it to override the default logged value formatting.
- metacells.utilities.logging.logging_calc() bool [source]¶
Whether we are actually logging the intermediate calculations.
- metacells.utilities.logging.log_calc(name: str, value: Any = None, *, formatter: Callable[[Any], Any] | None = None) bool [source]¶
Log an intermediate calculated
value
computed from a function with somename
.If
formatter
is specified, use it to override the default logged value formatting.
- metacells.utilities.logging.log_step(name: str, value: Any = None, *, formatter: Callable[[Any], Any] | None = None) Iterator[None] [source]¶
Same as
log_calc()
, but also further indent all the log messages inside thewith
statement body.
- metacells.utilities.logging.incremental(adata: AnnData, per: str, name: str, formatter: Callable[[Any], Any] | None = None) None [source]¶
Declare that the named annotation will be built incrementally - set and then repeatedly modified.
- metacells.utilities.logging.done_incrementals(adata: AnnData) None [source]¶
Declare that all the incremental values have been fully computed.
- metacells.utilities.logging.cancel_incrementals(adata: AnnData) None [source]¶
Cancel tracking incremental annotations.
- metacells.utilities.logging.log_set(adata: AnnData, per: str, name: str, value: Any, *, formatter: Callable[[Any], Any] | None = None) bool [source]¶
Log setting some annotated data.
- metacells.utilities.logging.log_get(adata: AnnData, per: str, name: Any, value: Any, *, formatter: Callable[[Any], Any] | None = None) bool [source]¶
Log getting some annotated data.
- metacells.utilities.logging.sizes_description(sizes: ndarray | Collection[int] | Collection[float] | PandasSeries | str) str [source]¶
Return a string for logging an array of sizes.
- metacells.utilities.logging.fractions_description(sizes: ndarray | Collection[int] | Collection[float] | PandasSeries | str) str [source]¶
Return a string for logging an array of fractions (between zero and one).
- metacells.utilities.logging.groups_description(groups: ndarray | Collection[int] | Collection[float] | PandasSeries | str) str [source]¶
Return a string for logging an array of group indices.
Note
This assumes that the indices are consecutive, with negative values indicating “outliers”.
- metacells.utilities.logging.mask_description(mask: str | ndarray | Collection[int] | Collection[float] | PandasSeries | CompressedMatrix | PandasFrame | SparseMatrix) str [source]¶
Return a string for logging a boolean mask.
- metacells.utilities.logging.ratio_description(denominator: float, element: str, numerator: float, condition: str, *, base: bool = True) str [source]¶
Return a string for describing a ratio (including a percent representation).
- metacells.utilities.logging.progress_description(amount: int, index: int, element: str) str [source]¶
Return a string for describing progress in a loop.
- metacells.utilities.logging.fraction_description(fraction: float | None) str [source]¶
Return a string for describing a fraction (including a percent representation).
- metacells.utilities.logging.fold_description(fold: float) str [source]¶
Return a string for describing a fraction (including a percent representation).
Documentation¶
Utilities for documenting Python functions.
- metacells.utilities.documentation.expand_doc(**kwargs: Any) Callable[[CALLABLE], CALLABLE] [source]¶
Expand the keyword arguments and the annotated function’s default argument values inside the function’s document string.
That is, given something like:
@expand_doc(foo=7) def bar(baz, vaz=5): """ Bar with {foo} foos and parameter vaz (default: {baz}). """
Then
help(bar)
will print:Bar with 7 foos and parameter vaz (default: 5).