Region Topic Analysis#

This tutorial covers topic modeling analysis with TF-MInDi:

  • Running topic modeling on regions using run_topic_modeling()

  • Evaluating different topic numbers with evaluate_topic_models()

  • Analyzing topic-cluster and topic-DBD relationships

  • Visualizing topics in t-SNE space and heatmaps

import tfmindi as tm

DATA_DIR = "../../../../data/tfmindi/"
# Load clustered data from previous tutorial
adata = tm.load_h5ad(DATA_DIR + "seqlets_clustered.h5ad")

Running Topic Modeling#

Topic modeling helps discover co-occurring TF binding patterns across genomic regions. We use run_topic_modeling() which applies Latent Dirichlet Allocation (LDA) to find topics based on seqlet cluster compositions.

Each region gets a topic probability distribution, and each topic represents a pattern of co-occurring TF binding clusters.

The results of the topic modeling will be stored in our anndata.uns["topic_modeling"].

# Run topic modeling with 15 topics
tm.tl.run_topic_modeling(adata, n_topics=15, random_state=42)
Filtered 0 seqlets with unknown DBD annotations
Using 679653 deduplicated seqlets across 37265 regions
Count matrix shape: (37265, 63) (regions × clusters)
Fitting LDA model with 15 topics...
Stored topic modeling results in adata.uns['topic_modeling']

Evaluating Topic Numbers#

To find the optimal number of topics, use evaluate_topic_models(). This tests different topic numbers and provides log likelihood scores to help choose the best model.
The best topic model will again be stored in the anndata.uns (will overwrite any existing models).

# Evaluate topic models with different numbers of topics
topic_range = [5, 20, 30, 50, 80, 100]
results = tm.tl.evaluate_topic_models(adata, topic_range, random_state=42)
Evaluating 6 different topic models...
Training model with 5 topics...
Filtered 0 seqlets with unknown DBD annotations
Using 679653 deduplicated seqlets across 37265 regions
Count matrix shape: (37265, 63) (regions × clusters)
Fitting LDA model with 5 topics...
Stored topic modeling results in adata.uns['topic_modeling']
Model with 5 topics: log-likelihood = 1581135.96
Training model with 20 topics...
Filtered 0 seqlets with unknown DBD annotations
Using 679653 deduplicated seqlets across 37265 regions
Count matrix shape: (37265, 63) (regions × clusters)
Fitting LDA model with 20 topics...
Stored topic modeling results in adata.uns['topic_modeling']
Model with 20 topics: log-likelihood = 4880502.14
Training model with 30 topics...
Filtered 0 seqlets with unknown DBD annotations
Using 679653 deduplicated seqlets across 37265 regions
Count matrix shape: (37265, 63) (regions × clusters)
Fitting LDA model with 30 topics...
Stored topic modeling results in adata.uns['topic_modeling']
Model with 30 topics: log-likelihood = 5343808.11
Training model with 50 topics...
Filtered 0 seqlets with unknown DBD annotations
Using 679653 deduplicated seqlets across 37265 regions
Count matrix shape: (37265, 63) (regions × clusters)
Fitting LDA model with 50 topics...
Stored topic modeling results in adata.uns['topic_modeling']
Model with 50 topics: log-likelihood = 5290414.07
Training model with 80 topics...
Filtered 0 seqlets with unknown DBD annotations
Using 679653 deduplicated seqlets across 37265 regions
Count matrix shape: (37265, 63) (regions × clusters)
Fitting LDA model with 80 topics...
Stored topic modeling results in adata.uns['topic_modeling']
Model with 80 topics: log-likelihood = 4242165.21
Training model with 100 topics...
Filtered 0 seqlets with unknown DBD annotations
Using 679653 deduplicated seqlets across 37265 regions
Count matrix shape: (37265, 63) (regions × clusters)
Fitting LDA model with 100 topics...
Stored topic modeling results in adata.uns['topic_modeling']
Model with 100 topics: log-likelihood = 3225860.75
Storing best model with 30 topics...
Filtered 0 seqlets with unknown DBD annotations
Using 679653 deduplicated seqlets across 37265 regions
Count matrix shape: (37265, 63) (regions × clusters)
Fitting LDA model with 30 topics...
Stored topic modeling results in adata.uns['topic_modeling']
# find n topics with highest log likelihood
best_n = max(results, key=results.get)
print(f"Best number of topics: {best_n} with log likelihood {results[best_n]}")
Best number of topics: 30 with log likelihood 5343808.110845597

Topic-Cluster Analysis#

Our topic modeling results include probabilities for which region belong to which topic.

adata.uns["topic_modeling"]["region_topic_matrix"].head()
Topic_1 Topic_2 Topic_3 Topic_4 Topic_5 Topic_6 Topic_7 Topic_8 Topic_9 Topic_10 ... Topic_21 Topic_22 Topic_23 Topic_24 Topic_25 Topic_26 Topic_27 Topic_28 Topic_29 Topic_30
0 0.023148 0.023148 0.023148 0.078704 0.023148 0.023148 0.023148 0.037037 0.023148 0.023148 ... 0.023148 0.023148 0.023148 0.037037 0.037037 0.023148 0.092593 0.023148 0.023148 0.037037
1 0.021930 0.035088 0.021930 0.021930 0.035088 0.021930 0.035088 0.021930 0.021930 0.021930 ... 0.021930 0.087719 0.021930 0.048246 0.074561 0.035088 0.035088 0.021930 0.021930 0.021930
2 0.026882 0.026882 0.043011 0.091398 0.026882 0.026882 0.026882 0.026882 0.026882 0.043011 ... 0.026882 0.026882 0.026882 0.059140 0.026882 0.026882 0.043011 0.026882 0.026882 0.026882
3 0.018519 0.018519 0.018519 0.096296 0.029630 0.018519 0.018519 0.074074 0.018519 0.029630 ... 0.018519 0.018519 0.018519 0.051852 0.074074 0.040741 0.107407 0.018519 0.018519 0.018519
4 0.023810 0.038095 0.038095 0.023810 0.038095 0.023810 0.023810 0.066667 0.023810 0.038095 ... 0.023810 0.066667 0.023810 0.023810 0.023810 0.038095 0.066667 0.038095 0.023810 0.023810

5 rows × 30 columns

Visualizing Topics#

DBD-Topic Heatmap#

Visualize the relationship between topics and DNA-binding domains using dbd_topic_heatmap(). This shows us which DBDs often co-occur.

%matplotlib inline
tm.pl.dbd_topic_heatmap(
    adata,
    width=12,
    height=8,
    x_label_rotation=90,
)
../_images/193c5bb8e05e0f2627079ae565231958a2d9ddbdc55ae604d6cc25136994cef9.png

Topic Visualization in t-SNE Space#

Plot t-SNE visualization of regions colored by topic probabilities using region_topic_tsne().

# Plot regions colored by dominant topic
tm.pl.region_topic_tsne(adata, topics_to_show=["Topic_1", "Topic_2", "Topic_3"], ncols=3, width=15)
Computing t-SNE coordinates from region-topic matrix...
../_images/85ce67afcda1f5dd1fa71aecda9998f6513077aecd22ca46f62e678e2cfca53a.png