tfmindi.pp.create_seqlet_adata

tfmindi.pp.create_seqlet_adata#

tfmindi.pp.create_seqlet_adata(similarity_matrix, seqlet_metadata, seqlet_matrices=None, oh_sequences=None, contrib_scores=None, motif_names=None, motif_collection=None, motif_annotations=None, motif_to_dbd=None, dtype=<class 'numpy.float32'>)#

Create comprehensive AnnData object storing all seqlet data for analysis pipeline.

Parameters:
  • similarity_matrix (csr_array) – Sparse log-transformed similarity array with shape (n_seqlets, n_motifs)

  • seqlet_metadata (DataFrame) – DataFrame with seqlet coordinates and metadata

  • seqlet_matrices (list[ndarray[Any, dtype[floating]]] | None (default: None)) – List of seqlet contribution matrices, each with shape (4, length)

  • oh_sequences (ndarray[Any, dtype[floating]] | None (default: None)) – One-hot sequences for each seqlet region with shape (n_examples, 4, total_length)

  • contrib_scores (ndarray[Any, dtype[floating]] | None (default: None)) – Raw contribution scores for each seqlet region with shape (n_examples, 4, total_length)

  • motif_names (list[str] | list[tuple[str, str]] | None (default: None)) – List of motif names corresponding to similarity matrix columns

  • motif_collection (dict[tuple[str, str], ndarray[Any, dtype[floating]]] | list[ndarray[Any, dtype[floating]]] | None (default: None)) – Dictionary or list of motif PPM matrices, each with shape (4, length)

  • motif_annotations (DataFrame | None (default: None)) – DataFrame with motif annotations containing TF names and other metadata

  • motif_to_dbd (dict[str, str] | None (default: None)) – Dictionary mapping motif names to DNA-binding domain annotations

  • dtype (type[floating] (default: <class 'numpy.float32'>)) – Data type for numerical arrays to optimize memory usage (default: np.float32)

Return type:

AnnData

Returns:

AnnData object with all data needed for downstream analysis

Data Storage:

  • .X: Sparse log-transformed motif similarity array (n_seqlets × n_motifs)

  • .obs: Seqlet metadata and variable-length arrays stored per seqlet

    • Standard metadata: coordinates, attribution, p-values

    • .obs[“seqlet_matrix”]: Individual seqlet contribution matrices

    • .obs[“seqlet_oh”]: Individual seqlet one-hot sequences

  • .obs: Additional seqlet mapping indices - .obs[“example_oh_idx”]: Index into unique examples for one-hot sequences - .obs[“example_contrib_idx”]: Index into unique examples for contribution scores

  • .uns: Memory-efficient storage for unique examples - .uns[“unique_examples”][“oh”]: Unique example one-hot sequences (n_unique_examples × 4 × length) - .uns[“unique_examples”][“contrib”]: Unique example contribution scores (n_unique_examples × 4 × length)

  • .var: Motif names and annotations - .var[“motif_ppm”]: Individual motif PPM matrices - .var[“dbd”]: DNA-binding domain annotations - .var[“direct_annot”]: Direct TF annotations - Other annotation columns from motif_annotations DataFrame

Examples

>>> seqlets_df, seqlet_matrices = tm.pp.extract_seqlets(contrib, oh)
>>> similarity_matrix = tm.pp.calculate_motif_similarity(seqlet_matrices, motifs)
>>> adata = tm.pp.create_seqlet_adata(
...     similarity_matrix,
...     seqlets_df,
...     seqlet_matrices=seqlet_matrices,
...     oh_sequences=oh,
...     contrib_scores=contrib,
...     motif_collection=motifs,
...     motif_annotations=annotations,
...     motif_to_dbd=motif_to_dbd_dict,
... )
>>> print(adata.shape)
(295, 17995)