tfmindi.pp.calculate_motif_similarity#
- tfmindi.pp.calculate_motif_similarity(seqlets, known_motifs, chunk_size=None, n_nearest=None, threshold=None, **kwargs)#
Calculate TomTom similarity and convert to log-space for clustering.
- Parameters:
seqlets (
list[ndarray]) – List of seqlet contribution matrices, each with shape (4, length)known_motifs (
list[ndarray] |dict[tuple[str,str],ndarray]) – List of known motif PPM matrices, each with shape (4, length) or a dictionary of motif PPMs, each with shape (4, length)chunk_size (
int|None(default:None)) – If provided, process seqlets in chunks of this size to manage memory usage. If None, process all seqlets at once (original behavior).n_nearest (
int|None(default:None)) – If provided, only keep the n most similar motifs for each seqlet. This creates naturally sparse matrices and reduces memory usage. If None, computes similarities to all motifs (with optional thresholding).threshold (
float|None(default:None)) – Similarity threshold for sparsity when n_nearest is None. Values below threshold are clipped to zero. Default 0.05. Ignored when n_nearest is specified.**kwargs – Additional arguments for memelite’s TomTom
- Return type:
- Returns:
Sparse log-transformed similarity array with shape (n_seqlets, n_motifs). When n_nearest is used, only the top-k similarities per seqlet are stored. When threshold is used, values below threshold are clipped to zero.
Examples
>>> _, seqlet_matrices = tfmindi.pp.extract_seqlets(contrib, oh) >>> # Memory-efficient: only keep top 50 similarities per seqlet >>> similarity_matrix = calculate_motif_similarity(seqlet_matrices, known_motifs, n_nearest=50) >>> print(similarity_matrix.shape) (1250, 3989) >>> # Traditional approach with thresholding >>> similarity_matrix = calculate_motif_similarity(seqlet_matrices, known_motifs, threshold=0.1) >>> # For large datasets, use chunking with n_nearest >>> similarity_matrix = calculate_motif_similarity(seqlet_matrices, known_motifs, chunk_size=10000, n_nearest=50)