Preprocessing#

This tutorial covers the preprocessing steps of TF-MInDi:

Loading motif collections and annotations
Extracting seqlets from contribution scores
Calculating motif similarities
Creating and saving tfmindi’s custom {class}anndata.Anndata objects

Data Requirements#

To get started with tfmindi, we need three types of data:

a motif database of known transcription factor motifs
one hot encoded sequences for genomic regions of interest
attribution scores representing nucleotide importances for these same regions from a trained neural network.

tfmindi provides functionality to fetch SCENIC+ motif collections if you don’t have your own available.
Getting attribution scores and one-hot encoded regions is outside the scope of tfmindi. We recommend using CREsted for this (i.e. with this function), yet any tool that can do this works fine (such as tangermeme). So long as you can load these into python as numpy arrays.

Download tutorial data#

Contribution scores#

Contribution score data to reproduce this tutorial is available on Zenodo using doi: 10.5281/zenodo.18757793. These are contribution scores generated using the deepBICCN model, see CREsted tutorial for more info.

!wget https://zenodo.org/records/18757794/files/tutorial_data.tar.gz

--2026-02-24 18:21:41--  https://zenodo.org/records/18757794/files/tutorial_data.tar.gz
Resolving zenodo.org (zenodo.org)... 188.185.48.75, 188.184.98.114, 188.184.103.118, ...
Connecting to zenodo.org (zenodo.org)|188.185.48.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1654120666 (1.5G) [application/octet-stream]
Saving to: ‘tutorial_data.tar.gz’

tutorial_data.tar.g 100%[===================>]   1.54G   103MB/s    in 17s     

2026-02-24 18:21:58 (95.2 MB/s) - ‘tutorial_data.tar.gz’ saved [1654120666/1654120666]

!tar -xvf tutorial_data.tar.gz

tutorial_data/
tutorial_data/modisco_results_ft_2000/
tutorial_data/modisco_results_ft_2000/Astro_oh.npz
tutorial_data/modisco_results_ft_2000/SstChodl_contrib.npz
tutorial_data/modisco_results_ft_2000/Sncg_oh.npz
tutorial_data/modisco_results_ft_2000/Micro_PVM_contrib.npz
tutorial_data/modisco_results_ft_2000/L6b_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Micro_PVM_oh.npz
tutorial_data/modisco_results_ft_2000/L5_6NP_contrib.npz
tutorial_data/modisco_results_ft_2000/Lamp5_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Vip_contrib.npz
tutorial_data/modisco_results_ft_2000/OPC_modisco_results.h5
tutorial_data/modisco_results_ft_2000/VLMC_oh.npz
tutorial_data/modisco_results_ft_2000/SstChodl_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Sncg_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Vip_oh.npz
tutorial_data/modisco_results_ft_2000/L6IT_oh.npz
tutorial_data/modisco_results_ft_2000/Oligo_modisco_results.h5
tutorial_data/modisco_results_ft_2000/L6CT_modisco_results.h5
tutorial_data/modisco_results_ft_2000/OPC_contrib.npz
tutorial_data/modisco_results_ft_2000/SstChodl_oh.npz
tutorial_data/modisco_results_ft_2000/Sst_oh.npz
tutorial_data/modisco_results_ft_2000/Endo_modisco_results.h5
tutorial_data/modisco_results_ft_2000/L2_3IT_contrib.npz
tutorial_data/modisco_results_ft_2000/Astro_modisco_results.h5
tutorial_data/modisco_results_ft_2000/L6CT_oh.npz
tutorial_data/modisco_results_ft_2000/L2_3IT_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Sncg_contrib.npz
tutorial_data/modisco_results_ft_2000/L6b_contrib.npz
tutorial_data/modisco_results_ft_2000/Pvalb_contrib.npz
tutorial_data/modisco_results_ft_2000/Oligo_oh.npz
tutorial_data/modisco_results_ft_2000/Endo_oh.npz
tutorial_data/modisco_results_ft_2000/Sst_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Vip_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Sst_contrib.npz
tutorial_data/modisco_results_ft_2000/OPC_oh.npz
tutorial_data/modisco_results_ft_2000/L5ET_contrib.npz
tutorial_data/modisco_results_ft_2000/Micro_PVM_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Lamp5_oh.npz
tutorial_data/modisco_results_ft_2000/VLMC_contrib.npz
tutorial_data/modisco_results_ft_2000/L5IT_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Pvalb_oh.npz
tutorial_data/modisco_results_ft_2000/L5ET_modisco_results.h5
tutorial_data/modisco_results_ft_2000/L6CT_contrib.npz
tutorial_data/modisco_results_ft_2000/Endo_contrib.npz
tutorial_data/modisco_results_ft_2000/L6IT_contrib.npz
tutorial_data/modisco_results_ft_2000/L5ET_oh.npz
tutorial_data/modisco_results_ft_2000/Astro_contrib.npz
tutorial_data/modisco_results_ft_2000/L5_6NP_oh.npz
tutorial_data/modisco_results_ft_2000/L5IT_contrib.npz
tutorial_data/modisco_results_ft_2000/L2_3IT_oh.npz
tutorial_data/modisco_results_ft_2000/Pvalb_modisco_results.h5
tutorial_data/modisco_results_ft_2000/L6b_oh.npz
tutorial_data/modisco_results_ft_2000/L5IT_oh.npz
tutorial_data/modisco_results_ft_2000/Oligo_contrib.npz
tutorial_data/modisco_results_ft_2000/L5_6NP_modisco_results.h5
tutorial_data/modisco_results_ft_2000/L6IT_modisco_results.h5
tutorial_data/modisco_results_ft_2000/VLMC_modisco_results.h5
tutorial_data/modisco_results_ft_2000/Lamp5_contrib.npz

Sampled motifs#

To reduce the number of TF-MINDI features (decreasing run time and memory requirements) we clustered our motif collection and from each cluster sampled 40 motifs. Let’s download a text file with those sampled motifs. Note, the use of this sampled collection is optional.

!wget https://raw.githubusercontent.com/aertslab/TF-MINDI/refs/heads/main/paper/sampled_motifs.txt

--2026-02-24 18:26:41--  https://raw.githubusercontent.com/aertslab/TF-MINDI/refs/heads/main/paper/sampled_motifs.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 105797 (103K) [text/plain]
Saving to: ‘sampled_motifs.txt’

sampled_motifs.txt  100%[===================>] 103.32K  --.-KB/s    in 0.01s   

2026-02-24 18:26:42 (9.57 MB/s) - ‘sampled_motifs.txt’ saved [105797/105797]

Fetching motif collections and annotations#

On top of the contribution scores and sequences, we’ll need motif collections to calculate similarity scores using calculate_motif_similarity() with the extracted seqlets.
TF-MInDi provides functionality to fetch and load some of the default SCENIC+ motif databases.
We can also fetch some motif annotations and motif-to-dbd mappings to visualize the seqlets later.

import tfmindi as tm

import os
import re
import numpy as np
from tqdm import tqdm

# fetch collection and annotations
motif_collection_dir = tm.fetch_motif_collection()
motif_annotations_file = tm.fetch_motif_annotations()

# to speed up all analyses, we can choose to only load a sampled subset of motifs. We have selected some and stored those names in a .txt file.
motif_samples_path = "sampled_motifs.txt"
with open(motif_samples_path) as f:
    motif_names = [line.strip() for line in f.readlines()]

# load them as dictionary of PPM matrices
motif_collection = tm.load_motif_collection(motif_collection_dir, motif_names=motif_names)
motif_annotations = tm.load_motif_annotations(motif_annotations_file)

# load motif to dna-binding domain (DBD) mapping
motif_to_db = tm.load_motif_to_dbd(motif_annotations)

Extracting seqlets#

Seqlets are spans of nucleotides with high importance scores. We extract them from our nucleotide-level contribution scores (per cell type, coming from a CREsted model) with extract_seqlets(). We calculated these scores for cell-type-specific enhancer regions only, so we can link each region back to a specific cell type, but this is not required to run tfmindi.

extract_seqlets projects the contributions to a 1D importance track ((contrib * oh).sum(1)) and then applies a seqlet-calling algorithm, selectable through the method argument:

"recursive_q99_abs_smooth" (default) — smooths the track, normalizes each region by its 99th percentile, and runs the recursive caller on the absolute importance, so both activating (positive) and repressing (negative) seqlets are captured.
"recursive_raw" — the recursive caller on the raw signed track. This reproduces the behaviour of earlier TF-MInDi versions, based on tangermeme’s recursive_seqlets (which only calls positive seqlets).
"hysteresis" — a two-threshold local caller (high-importance seed, lower-importance growth).
"local_contrast" — a multi-scale sliding-window contrast caller.
"wavelet_otsu" — wavelet denoising followed by Otsu thresholding (requires PyWavelets).

Method-specific hyperparameters are passed as keyword arguments, e.g. threshold=0.05 and additional_flanks=3 for the recursive callers, seed_z=3.0 for "hysteresis", or threshold_scale=2.0 for "wavelet_otsu".

The resulting seqlets are scaled to a range of [-1, 1] and sign-corrected so the average contribution values are always positive.

The expected input shape of the contribution scores and one-hot-encoded regions is (N, 4, W).

# extract_seqlets expects the inputs to be in shape (n, 4, region_width), so we concatenate the cell type specific contributions
CONTRIB_FOLDER = "tutorial_data/modisco_results_ft_2000/"

contrib_list = []
oh_list = []
classes_list = []

# getting cell type names from file names (optional, not required to run tfmindi)
class_names = [
    re.match(r"(.+?)_oh\.npz$", f).group(1)  # type: ignore
    for f in os.listdir(CONTRIB_FOLDER)
    if f.endswith("_oh.npz")
]

for i, c in enumerate(tqdm(class_names)):
    contrib_list.append(np.load(os.path.join(CONTRIB_FOLDER, f"{c}_contrib.npz"))["arr_0"])
    oh_list.append(np.load(os.path.join(CONTRIB_FOLDER, f"{c}_oh.npz"))["arr_0"])
    classes_list.append(np.repeat(c, oh_list[i].shape[0]))

oh = np.concatenate(oh_list)
contrib = np.concatenate(contrib_list)
classes = np.concatenate(classes_list)  # not required to run tfmindi, but useful for interpretation later
region_id_to_ct_map = {
    i: str(c) for i, c in enumerate(classes)
}  # not required to run tfmindi, but useful for interpretation later

# Extract seqlets from the contributions and one-hot-encoded sequences.
# The default caller is "recursive_q99_abs_smooth"; other options are
# "recursive_raw", "hysteresis", "local_contrast", and "wavelet_otsu".
seqlets_df, seqlets_matrices = tm.pp.extract_seqlets(
    contrib=contrib,
    oh=oh,
    method="recursive_q99_abs_smooth",  # default seqlet caller
    threshold=0.05,  # importance threshold for the recursive callers. Lower = fewer seqlets.
    additional_flanks=3,  # flanking bases to include around each seqlet
)

# To switch caller and tune it in the same call, pass method + its keyword args, e.g.:
# seqlets_df, seqlets_matrices = tm.pp.extract_seqlets(contrib, oh, method="hysteresis", seed_z=3.0)

len(seqlets_matrices)

seqlets_df.head(3)  # information on seqlet position and significance

seqlets_matrices[0].shape  # seqlets_matrices is a list with len(seqlets), that contains the scaled matrix per seqlet

(4, 13)

Calculating motif similarity#

Next, we calculate similarity scores using calculate_motif_similarity() between the extracted seqlets and the SCENIC+ motif collection by using memelite’s TomTom implementation.
The resulting similarity matrix will be log-transformed and negated before performing clustering.

sim_matrix = tm.pp.calculate_motif_similarity(
    seqlets_matrices,
    motif_collection,
    chunk_size=50000,  # you can supply a seqlet chunk_size if you have memory constraints
    n_nearest=100,  # you can limit the number of nearest motif similarities to store per seqlet to reduce memory usage
)

print(sim_matrix.shape)  # (n_seqlets, n_motifs)

(679653, 3989)

We can store the data objects we have processed so far together in a single Anndata object.
This anndata object will become the input for all our tfmindi.tl tooling and tfmindi.pl plotting functions.
The motif annotations etc are not mandatory but can be used to create visualisations later.

adata = tm.pp.create_seqlet_adata(
    sim_matrix,  # mandatory
    seqlets_df,  # mandatory
    seqlet_matrices=seqlets_matrices,
    oh_sequences=oh,
    contrib_scores=contrib,
    motif_collection=motif_collection,
    motif_annotations=motif_annotations,
    motif_to_dbd=motif_to_db,
)
adata

# optional: let's also add a "cell_type" column to the adata.obs based on our classes and example_idx (the region's idx)
adata.obs["cell_type"] = adata.obs["example_idx"].map(region_id_to_ct_map).astype("category")

adata.obs.head(3)

adata.var.head(3)

	motif_ppm	Direct_annot	Motif_similarity_annot	Orthology_annot	Motif_similarity_and_Orthology_annot	dbd
hocomoco__CENPB_HUMAN.H11MO.0.D	[[0.04821992, 0.1383506, 0.012167643, 0.000450...	CENPB	NaN	NaN	NaN	CENPB
hocomoco__CPEB1_HUMAN.H11MO.0.D	[[0.1636162, 0.0005546312, 0.0049916804, 0.000...	CPEB1	ZNF287, ZNF362	NaN	NaN	C2H2 ZF
neph__UW.Motif.0049	[[0.870051, 0.059264, 0.038039, 0.069257, 0.94...	NaN	POU2F1, E2F1, YY1, PAX4, POU2F2	NaN	NANOG, NANOGP8, SOX2	Homeodomain

Saving our preprocesssed data#

Let’s save our anndata so we don’t have to run these analyses again.
We can’t use standard Anndata functionality for this, since we’re storing numpy arrays in both our .obs and .var. This is a logical way to structure our data, but is not allowed by Anndata when saving and loading.
Therefore, we have our own I/O functions that are simple wrappers around Anndata’s functions (wherein the numpy arrays are moved to and back from .uns, which does allow for arrays).
You can save and load anndatas with save_h5ad() and load_h5ad(). Be aware that these files can become large.

tm.save_h5ad(adata, "seqlets.h5ad")

Preprocessing

Contents