tfmindi.pp.create_seqlet_adata#
- tfmindi.pp.create_seqlet_adata(similarity_matrix, seqlet_metadata, seqlet_matrices=None, oh_sequences=None, contrib_scores=None, motif_names=None, motif_collection=None, motif_annotations=None, motif_to_dbd=None, dtype=<class 'numpy.float32'>)#
Create comprehensive AnnData object storing all seqlet data for analysis pipeline.
- Parameters:
similarity_matrix (
csr_array) – Sparse log-transformed similarity array with shape (n_seqlets, n_motifs)seqlet_metadata (
DataFrame) – DataFrame with seqlet coordinates and metadataseqlet_matrices (
list[ndarray[Any,dtype[floating]]] |None(default:None)) – List of seqlet contribution matrices, each with shape (4, length)oh_sequences (
ndarray[Any,dtype[floating]] |None(default:None)) – One-hot sequences for each seqlet region with shape (n_examples, 4, total_length)contrib_scores (
ndarray[Any,dtype[floating]] |None(default:None)) – Raw contribution scores for each seqlet region with shape (n_examples, 4, total_length)motif_names (
list[str] |list[tuple[str,str]] |None(default:None)) – List of motif names corresponding to similarity matrix columnsmotif_collection (
dict[tuple[str,str],ndarray[Any,dtype[floating]]] |list[ndarray[Any,dtype[floating]]] |None(default:None)) – Dictionary or list of motif PPM matrices, each with shape (4, length)motif_annotations (
DataFrame|None(default:None)) – DataFrame with motif annotations containing TF names and other metadatamotif_to_dbd (
dict[str,str] |None(default:None)) – Dictionary mapping motif names to DNA-binding domain annotationsdtype (
type[floating] (default:<class 'numpy.float32'>)) – Data type for numerical arrays to optimize memory usage (default: np.float32)
- Return type:
- Returns:
AnnData object with all data needed for downstream analysis
Data Storage:
.X: Sparse log-transformed motif similarity array (n_seqlets × n_motifs)
.obs: Seqlet metadata and variable-length arrays stored per seqlet
Standard metadata: coordinates, attribution, p-values
.obs[“seqlet_matrix”]: Individual seqlet contribution matrices
.obs[“seqlet_oh”]: Individual seqlet one-hot sequences
.obs: Additional seqlet mapping indices - .obs[“example_oh_idx”]: Index into unique examples for one-hot sequences - .obs[“example_contrib_idx”]: Index into unique examples for contribution scores
.uns: Memory-efficient storage for unique examples - .uns[“unique_examples”][“oh”]: Unique example one-hot sequences (n_unique_examples × 4 × length) - .uns[“unique_examples”][“contrib”]: Unique example contribution scores (n_unique_examples × 4 × length)
.var: Motif names and annotations - .var[“motif_ppm”]: Individual motif PPM matrices - .var[“dbd”]: DNA-binding domain annotations - .var[“direct_annot”]: Direct TF annotations - Other annotation columns from motif_annotations DataFrame
Examples
>>> seqlets_df, seqlet_matrices = tm.pp.extract_seqlets(contrib, oh) >>> similarity_matrix = tm.pp.calculate_motif_similarity(seqlet_matrices, motifs) >>> adata = tm.pp.create_seqlet_adata( ... similarity_matrix, ... seqlets_df, ... seqlet_matrices=seqlet_matrices, ... oh_sequences=oh, ... contrib_scores=contrib, ... motif_collection=motifs, ... motif_annotations=annotations, ... motif_to_dbd=motif_to_dbd_dict, ... ) >>> print(adata.shape) (295, 17995)