Preprocessing#
This tutorial covers the preprocessing steps of TF-MInDi:
Loading motif collections and annotations
Extracting seqlets from contribution scores
Calculating motif similarities
Creating and saving tfmindi’s custom {class}anndata.Anndata objects
Data Requirements#
To get started with tfmindi, we need three types of data:
a motif database of known transcription factor motifs
one hot encoded sequences for genomic regions of interest
attribution scores representing nucleotide importances for these same regions from a trained neural network.
tfmindi provides functionality to fetch SCENIC+ motif collections if you don’t have your own available.
Getting attribution scores and one-hot encoded regions is outside the scope of tfmindi. We recommend using CREsted for this (i.e. with this function), yet any tool that can do this works fine (such as tangermeme). So long as you can load these into python as numpy arrays.
Download tutorial data#
Contribution scores#
Contribution score data to reproduce this tutorial is available on Zenodo using doi: 10.5281/zenodo.18757793. These are contribution scores generated using the deepBICCN model, see CREsted tutorial for more info.
!wget https://zenodo.org/records/18757794/files/tutorial_data.tar.gz
--2026-02-24 18:21:41-- https://zenodo.org/records/18757794/files/tutorial_data.tar.gz
Resolving zenodo.org (zenodo.org)... 188.185.48.75, 188.184.98.114, 188.184.103.118, ...
Connecting to zenodo.org (zenodo.org)|188.185.48.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1654120666 (1.5G) [application/octet-stream]
Saving to: ‘tutorial_data.tar.gz’
tutorial_data.tar.g 100%[===================>] 1.54G 103MB/s in 17s
2026-02-24 18:21:58 (95.2 MB/s) - ‘tutorial_data.tar.gz’ saved [1654120666/1654120666]
!tar -xvf tutorial_data.tar.gz
tutorial_data/
tutorial_data/modisco_results_ft_2000/
tutorial_data/modisco_results_ft_2000/Astro_oh.npz
tutorial_data/modisco_results_ft_2000/SstChodl_contrib.npz
tutorial_data/modisco_results_ft_2000/Sncg_oh.npz
tutorial_data/modisco_results_ft_2000/Micro_PVM_contrib.npz
tutorial_data/modisco_results_ft_2000/L6b_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Micro_PVM_oh.npz
tutorial_data/modisco_results_ft_2000/L5_6NP_contrib.npz
tutorial_data/modisco_results_ft_2000/Lamp5_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Vip_contrib.npz
tutorial_data/modisco_results_ft_2000/OPC_modisco_results.h5
tutorial_data/modisco_results_ft_2000/VLMC_oh.npz
tutorial_data/modisco_results_ft_2000/SstChodl_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Sncg_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Vip_oh.npz
tutorial_data/modisco_results_ft_2000/L6IT_oh.npz
tutorial_data/modisco_results_ft_2000/Oligo_modisco_results.h5
tutorial_data/modisco_results_ft_2000/L6CT_modisco_results.h5
tutorial_data/modisco_results_ft_2000/OPC_contrib.npz
tutorial_data/modisco_results_ft_2000/SstChodl_oh.npz
tutorial_data/modisco_results_ft_2000/Sst_oh.npz
tutorial_data/modisco_results_ft_2000/Endo_modisco_results.h5
tutorial_data/modisco_results_ft_2000/L2_3IT_contrib.npz
tutorial_data/modisco_results_ft_2000/Astro_modisco_results.h5
tutorial_data/modisco_results_ft_2000/L6CT_oh.npz
tutorial_data/modisco_results_ft_2000/L2_3IT_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Sncg_contrib.npz
tutorial_data/modisco_results_ft_2000/L6b_contrib.npz
tutorial_data/modisco_results_ft_2000/Pvalb_contrib.npz
tutorial_data/modisco_results_ft_2000/Oligo_oh.npz
tutorial_data/modisco_results_ft_2000/Endo_oh.npz
tutorial_data/modisco_results_ft_2000/Sst_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Vip_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Sst_contrib.npz
tutorial_data/modisco_results_ft_2000/OPC_oh.npz
tutorial_data/modisco_results_ft_2000/L5ET_contrib.npz
tutorial_data/modisco_results_ft_2000/Micro_PVM_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Lamp5_oh.npz
tutorial_data/modisco_results_ft_2000/VLMC_contrib.npz
tutorial_data/modisco_results_ft_2000/L5IT_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Pvalb_oh.npz
tutorial_data/modisco_results_ft_2000/L5ET_modisco_results.h5
tutorial_data/modisco_results_ft_2000/L6CT_contrib.npz
tutorial_data/modisco_results_ft_2000/Endo_contrib.npz
tutorial_data/modisco_results_ft_2000/L6IT_contrib.npz
tutorial_data/modisco_results_ft_2000/L5ET_oh.npz
tutorial_data/modisco_results_ft_2000/Astro_contrib.npz
tutorial_data/modisco_results_ft_2000/L5_6NP_oh.npz
tutorial_data/modisco_results_ft_2000/L5IT_contrib.npz
tutorial_data/modisco_results_ft_2000/L2_3IT_oh.npz
tutorial_data/modisco_results_ft_2000/Pvalb_modisco_results.h5
tutorial_data/modisco_results_ft_2000/L6b_oh.npz
tutorial_data/modisco_results_ft_2000/L5IT_oh.npz
tutorial_data/modisco_results_ft_2000/Oligo_contrib.npz
tutorial_data/modisco_results_ft_2000/L5_6NP_modisco_results.h5
tutorial_data/modisco_results_ft_2000/L6IT_modisco_results.h5
tutorial_data/modisco_results_ft_2000/VLMC_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Lamp5_contrib.npz
Sampled motifs#
To reduce the number of TF-MINDI features (decreasing run time and memory requirements) we clustered our motif collection and from each cluster sampled 40 motifs. Let’s download a text file with those sampled motifs. Note, the use of this sampled collection is optional.
!wget https://raw.githubusercontent.com/aertslab/TF-MINDI/refs/heads/main/paper/sampled_motifs.txt
--2026-02-24 18:26:41-- https://raw.githubusercontent.com/aertslab/TF-MINDI/refs/heads/main/paper/sampled_motifs.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 105797 (103K) [text/plain]
Saving to: ‘sampled_motifs.txt’
sampled_motifs.txt 100%[===================>] 103.32K --.-KB/s in 0.01s
2026-02-24 18:26:42 (9.57 MB/s) - ‘sampled_motifs.txt’ saved [105797/105797]
Fetching motif collections and annotations#
On top of the contribution scores and sequences, we’ll need motif collections to calculate similarity scores using calculate_motif_similarity() with the extracted seqlets.
TF-MInDi provides functionality to fetch and load some of the default SCENIC+ motif databases.
We can also fetch some motif annotations and motif-to-dbd mappings to visualize the seqlets later.
import tfmindi as tm
import os
import re
import numpy as np
from tqdm import tqdm
# fetch collection and annotations
motif_collection_dir = tm.fetch_motif_collection()
motif_annotations_file = tm.fetch_motif_annotations()
# to speed up all analyses, we can choose to only load a sampled subset of motifs. We have selected some and stored those names in a .txt file.
motif_samples_path = "sampled_motifs.txt"
with open(motif_samples_path) as f:
motif_names = [line.strip() for line in f.readlines()]
# load them as dictionary of PPM matrices
motif_collection = tm.load_motif_collection(motif_collection_dir, motif_names=motif_names)
motif_annotations = tm.load_motif_annotations(motif_annotations_file)
# load motif to dna-binding domain (DBD) mapping
motif_to_db = tm.load_motif_to_dbd(motif_annotations)
Extracting seqlets using tangermeme#
We use tangermeme to extract seqlets (spans of nucleotides with high importance scores) from our nucleotide level contribution scores per cell type coming from a CREsted model.
We calculated these scores for cell-type specific enhancer regions only so we will be able to link each region back to a specific cell type, but this is not required to run tfmindi.
To extract seqlets, we use the extract_seqlets() function, which wraps tangermemes recursive_seqlets functionality.
The resulting seqlets will be scaled to a range of [-1,1] and sign corrected so the average contribution values are always positive.
The expected input shape of the contribution scores and one hot-encoded regions is (N, 4, W).
# extract_seqlets expects the inputs to be in shape (n, 4, region_width), so we concatenate the cell type specific contributions
CONTRIB_FOLDER = "tutorial_data/modisco_results_ft_2000/"
contrib_list = []
oh_list = []
classes_list = []
# getting cell type names from file names (optional, not required to run tfmindi)
class_names = [
re.match(r"(.+?)_oh\.npz$", f).group(1) # type: ignore
for f in os.listdir(CONTRIB_FOLDER)
if f.endswith("_oh.npz")
]
for i, c in enumerate(tqdm(class_names)):
contrib_list.append(np.load(os.path.join(CONTRIB_FOLDER, f"{c}_contrib.npz"))["arr_0"])
oh_list.append(np.load(os.path.join(CONTRIB_FOLDER, f"{c}_oh.npz"))["arr_0"])
classes_list.append(np.repeat(c, oh_list[i].shape[0]))
oh = np.concatenate(oh_list)
contrib = np.concatenate(contrib_list)
classes = np.concatenate(classes_list) # not required to run tfmindi, but useful for interpretation later
region_id_to_ct_map = {
i: str(c) for i, c in enumerate(classes)
} # not required to run tfmindi, but useful for interpretation later
# extract seqlets from the contributions and one-hot encoded sequences
seqlets_df, seqlets_matrices = tm.pp.extract_seqlets(
contrib=contrib,
oh=oh,
threshold=0.05, # importance threshold for seqlet extraction. Lower = fewer seqlets.
additional_flanks=3, # flanking bases to include around seqlet
)
len(seqlets_matrices)
679653
seqlets_df.head(3) # information on seqlet position and significance
| example_idx | start | end | attribution | p-value | |
|---|---|---|---|---|---|
| 0 | 18770 | 1052 | 1065 | 22.867273 | 2.963802e-15 |
| 1 | 19895 | 1103 | 1134 | 42.865262 | 2.971223e-15 |
| 2 | 19813 | 1045 | 1059 | 13.644972 | 4.303625e-15 |
seqlets_matrices[0].shape # seqlets_matrices is a list with len(seqlets), that contains the scaled matrix per seqlet
(4, 13)
Calculating motif similarity#
Next, we calculate similarity scores using calculate_motif_similarity() between the extracted seqlets and the SCENIC+ motif collection by using memelite’s TomTom implementation.
The resulting similarity matrix will be log-transformed and negated before performing clustering.
sim_matrix = tm.pp.calculate_motif_similarity(
seqlets_matrices,
motif_collection,
chunk_size=50000, # you can supply a seqlet chunk_size if you have memory constraints
n_nearest=100, # you can limit the number of nearest motif similarities to store per seqlet to reduce memory usage
)
print(sim_matrix.shape) # (n_seqlets, n_motifs)
(679653, 3989)
We can store the data objects we have processed so far together in a single Anndata object.
This anndata object will become the input for all our tfmindi.tl tooling and tfmindi.pl plotting functions.
The motif annotations etc are not mandatory but can be used to create visualisations later.
adata = tm.pp.create_seqlet_adata(
sim_matrix, # mandatory
seqlets_df, # mandatory
seqlet_matrices=seqlets_matrices,
oh_sequences=oh,
contrib_scores=contrib,
motif_collection=motif_collection,
motif_annotations=motif_annotations,
motif_to_dbd=motif_to_db,
)
adata
AnnData object with n_obs × n_vars = 679653 × 3989
obs: 'example_idx', 'start', 'end', 'attribution', 'p-value', 'seqlet_matrix', 'seqlet_oh', 'example_oh_idx', 'example_contrib_idx'
var: 'motif_ppm', 'Direct_annot', 'Motif_similarity_annot', 'Orthology_annot', 'Motif_similarity_and_Orthology_annot', 'dbd'
uns: 'unique_examples'
# optional: let's also add a "cell_type" column to the adata.obs based on our classes and example_idx (the region's idx)
adata.obs["cell_type"] = adata.obs["example_idx"].map(region_id_to_ct_map).astype("category")
adata.obs.head(3)
| example_idx | start | end | attribution | p-value | seqlet_matrix | seqlet_oh | example_oh_idx | example_contrib_idx | cell_type | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18770 | 1052 | 1065 | 22.867273 | 2.963802e-15 | [[-0.38142416, -0.21325767, 0.56363165, -0.447... | [[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0,... | 0 | 0 | Oligo |
| 1 | 19895 | 1103 | 1134 | 42.865262 | 2.971223e-15 | [[0.9126998, 0.572017, 0.33544934, 0.30004725,... | [[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,... | 1 | 1 | Oligo |
| 2 | 19813 | 1045 | 1059 | 13.644972 | 4.303625e-15 | [[0.18059267, -0.007956118, -0.34745765, -0.09... | [[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0,... | 2 | 2 | Oligo |
adata.var.head(3)
| motif_ppm | Direct_annot | Motif_similarity_annot | Orthology_annot | Motif_similarity_and_Orthology_annot | dbd | |
|---|---|---|---|---|---|---|
| hocomoco__CENPB_HUMAN.H11MO.0.D | [[0.04821992, 0.1383506, 0.012167643, 0.000450... | CENPB | NaN | NaN | NaN | CENPB |
| hocomoco__CPEB1_HUMAN.H11MO.0.D | [[0.1636162, 0.0005546312, 0.0049916804, 0.000... | CPEB1 | ZNF287, ZNF362 | NaN | NaN | C2H2 ZF |
| neph__UW.Motif.0049 | [[0.870051, 0.059264, 0.038039, 0.069257, 0.94... | NaN | POU2F1, E2F1, YY1, PAX4, POU2F2 | NaN | NANOG, NANOGP8, SOX2 | Homeodomain |
Saving our preprocesssed data#
Let’s save our anndata so we don’t have to run these analyses again.
We can’t use standard Anndata functionality for this, since we’re storing numpy arrays in both our .obs and .var. This is a logical way to structure our data, but is not allowed by Anndata when saving and loading.
Therefore, we have our own I/O functions that are simple wrappers around Anndata’s functions (wherein the numpy arrays are moved to and back from .uns, which does allow for arrays).
You can save and load anndatas with save_h5ad() and load_h5ad().
Be aware that these files can become large.
tm.save_h5ad(adata, "seqlets.h5ad")