tfmindi.pp.calculate_motif_similarity

tfmindi.pp.calculate_motif_similarity#

tfmindi.pp.calculate_motif_similarity(seqlets, known_motifs, chunk_size=None, n_nearest=None, threshold=None, **kwargs)#

Calculate TomTom similarity and convert to log-space for clustering.

Parameters:
  • seqlets (list[ndarray]) – List of seqlet contribution matrices, each with shape (4, length)

  • known_motifs (list[ndarray] | dict[tuple[str, str], ndarray]) – List of known motif PPM matrices, each with shape (4, length) or a dictionary of motif PPMs, each with shape (4, length)

  • chunk_size (int | None (default: None)) – If provided, process seqlets in chunks of this size to manage memory usage. If None, process all seqlets at once (original behavior).

  • n_nearest (int | None (default: None)) – If provided, only keep the n most similar motifs for each seqlet. This creates naturally sparse matrices and reduces memory usage. If None, computes similarities to all motifs (with optional thresholding).

  • threshold (float | None (default: None)) – Similarity threshold for sparsity when n_nearest is None. Values below threshold are clipped to zero. Default 0.05. Ignored when n_nearest is specified.

  • **kwargs – Additional arguments for memelite’s TomTom

Return type:

csr_array

Returns:

Sparse log-transformed similarity array with shape (n_seqlets, n_motifs). When n_nearest is used, only the top-k similarities per seqlet are stored. When threshold is used, values below threshold are clipped to zero.

Examples

>>> _, seqlet_matrices = tfmindi.pp.extract_seqlets(contrib, oh)
>>> # Memory-efficient: only keep top 50 similarities per seqlet
>>> similarity_matrix = calculate_motif_similarity(seqlet_matrices, known_motifs, n_nearest=50)
>>> print(similarity_matrix.shape)
(1250, 3989)
>>> # Traditional approach with thresholding
>>> similarity_matrix = calculate_motif_similarity(seqlet_matrices, known_motifs, threshold=0.1)
>>> # For large datasets, use chunking with n_nearest
>>> similarity_matrix = calculate_motif_similarity(seqlet_matrices, known_motifs, chunk_size=10000, n_nearest=50)