Interpretable Patches - Mascharak#

  • This tutorial attempts to find wounding related genes using only the data originating from v3 sequencing runs contained in the Vu (2022) dataset.

  • The data is a combination of aged & young cells collected at time points 0DPW (uninjured), 4DPW and 7DPW.

  • Throughout this tutorial, we will use the workflow package with Patches to find genes that are associated with the 4DPW timepoint.

# Minimal mports
from ladder.data import get_data
from ladder.scripts import InterpretableWorkflow # Our workflow object to run the interpretable model
import umap, torch, pyro, os # To set seeds + umaps
import torch.optim as opt # For defining out optimizer


# For plotting
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# For data loading
import anndata as ad
import scanpy as sc

Formatting The Data#

The workflows use anndata objects as input. Specifically, the data we provide can also be manually downloaded here.

# Download the data object
get_data("Mascharak") 
Object saved at ./data/mascharak_2022_tn_wh.h5ad
# Load the anndata object
anndata = ad.read_h5ad("data/mascharak_2022_tn_wh.h5ad")
anndata.layers["normalized"] = anndata.X

# Find/subset HVGs & swap to raw counts
sc.pp.highly_variable_genes(anndata, n_top_genes=3000)
sc.pl.highly_variable_genes(anndata)

anndata = anndata[:, anndata.var["highly_variable"]]
anndata.X = anndata.layers["counts"]
../_images/a600a7d280fd9ca9be7c2867cb27ac8b8920f3706d16301496bcbd04548e21e9.png
anndata
View of AnnData object with n_obs × n_vars = 2284 × 3000
    obs: 'time', 'treatment', 'sample', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'n_counts', 'leiden', 'broad_type'
    var: 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'broad_type_colors', 'dendrogram_broad_type', 'dendrogram_leiden', 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'sample_colors', 'time_colors', 'treatment_colors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts', 'normalized'
    obsp: 'connectivities', 'distances'

Initializing the Interpretable Workflow#

  • Models can be run through two workflow interfaces: InterpretableWorkflow and CrossConditionWorkflow (see the corresponding tutorial)

  • Once initialized, you can train and evaluate the model through the workflow interface.

# Initialize workflow object
workflow = InterpretableWorkflow(anndata, verbose=True, random_seed=42)

# Define the condition classes & batch key to prepare the data
factors = ["time", "treatment", "broad_type"]
workflow.prep_model(factors, model_type='Patches', model_args={'ld_normalize' : True})

workflow.run_model(max_epochs=10000, convergence_threshold=1e-5, convergence_window=10000) # Lower the convergence threshold if you need a more accurate model, will increase training time
workflow.save_model("params/mascharak_ld")

Observing the Losses#

workflow.plot_loss()
../_images/8ea127d3512eeb874f1a5eee68f56d5a2c51e10b50d4e007636ebc2c85c2934a.png

Obtaining Latent Embeddings#

workflow.write_embeddings()
workflow.anndata.obsm
Written embeddings to object 'anndata.obsm' under workflow.
AxisArrays with keys: X_pca, X_umap, patches_w_latent, patches_z_latent

Evaluation of Model Reconstruction#

workflow.evaluate_reconstruction()
Calculating RMSE ...
Calculating Profile Correlation ...
Calculating 2-Sliced Wasserstein ...
Calculating Chamfer Discrepancy ...
Results
===================
RMSE : 10.877 +- 0.757
Profile Correlation : 0.92 +- 0.005
2-Sliced Wasserstein : 21.077 +- 1.42
Chamfer Discrepancy : 159.829 +- 86.174
{'RMSE': [10.877, 0.757],
 'Profile Correlation': [0.92, 0.005],
 '2-Sliced Wasserstein': [21.077, 1.42],
 'Chamfer Discrepancy': [159.829, 86.174]}

Interpretability - Common and Conditional Gene Loadings#

  • Patches can generate condition-specific scores for each gene, describing the association of the gene and the condition, even if the conditions were always encountered combinatorially.

  • Positive/Negative values mean that the given gene tends to be expressed relatively more/less for that condition compared to the basal state.

workflow.get_conditional_loadings()
workflow.get_common_loadings()
workflow.anndata.var
Written condition specific loadings to 'self.anndata.var'.
Written common loadings to 'self.anndata.var'.
mt ribo hb n_cells_by_counts mean_counts log1p_mean_counts pct_dropout_by_counts total_counts log1p_total_counts highly_variable ... Fibroblast_score_Patches Gamma-delta T Cell_score_Patches Granulocyte_score_Patches Macrophage_score_Patches Melanocyte_score_Patches Neutrophil_score_Patches Red Blood Cell_score_Patches Skeletal Muscle_score_Patches T Cell_score_Patches common_score_Patches
Sox17 False False False 14 0.018039 0.017878 99.540833 55.0 4.025352 True ... 0.126722 0.005121 0.073156 -0.224584 -0.135115 0.113830 -0.104980 0.100207 0.134213 0.092391
Adhfe1 False False False 27 0.010495 0.010441 99.114464 32.0 3.496508 True ... 0.040870 0.103973 -0.257579 -0.281439 -0.132003 0.034305 0.099367 0.087370 0.023955 0.536997
Cpa6 False False False 20 0.007871 0.007841 99.344047 24.0 3.218876 True ... 0.115090 0.094494 -0.050590 0.071744 0.157219 0.069963 0.097167 -0.020337 -0.088484 0.356553
Prex2 False False False 133 0.088882 0.085151 95.637914 271.0 5.605802 True ... 0.177728 -0.063112 0.151966 0.002831 -0.114980 -0.188440 0.072596 0.014520 -0.039014 -0.161527
Sulf1 False False False 168 0.126927 0.119494 94.489997 387.0 5.961005 True ... 0.168314 -0.109854 0.161246 -0.019156 0.215137 0.149358 -0.025992 0.086490 0.068549 -0.280242
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Gfra1 False False False 80 0.065595 0.063534 97.376189 200.0 5.303305 True ... 0.198698 0.024065 -0.002350 -0.106101 -0.037805 -0.134026 0.049773 -0.050908 -0.183474 -0.232669
Slc18a2 False False False 33 0.036733 0.036075 98.917678 112.0 4.727388 True ... -0.173744 0.156180 0.134977 -0.064383 0.069190 0.065861 -0.296098 -0.083482 0.225230 -0.013733
Rps12-ps3 False True False 368 0.145622 0.135947 87.930469 444.0 6.098074 True ... -0.075951 -0.060358 0.086086 0.011824 -0.006700 0.076035 0.034543 0.037943 -0.018578 -0.259352
Emx2 False False False 69 0.030830 0.030364 97.736963 94.0 4.553877 True ... 0.201313 0.149887 0.189365 -0.197546 0.185826 -0.089447 -0.022594 -0.010877 -0.275604 0.682746
CR974586.5 False False False 74 0.074779 0.072115 97.572975 228.0 5.433722 True ... 0.304274 -0.162427 0.168100 -0.238681 -0.177201 -0.240768 -0.105501 -0.045831 -0.150040 -0.046855

3000 rows × 28 columns

Bonus#

  1. Run the cell below to get the top 200 genes

  2. Copy & Paste the output to the Gene Ontology Resource

  3. Select ‘biological process’ and ‘Mus Musculus’ from the list below and click launch.

for gene in (workflow.anndata.var["Macrophage_score_Patches"]).sort_values(ascending=False)[:200].index:
    print(gene)
Lyz2
Lpl
Cd74
Cd14
Fn1
H2-Ab1
Ms4a1
H2-Aa
Psap
Thbs1
H2-Eb1
Map3k12
Ms4a6d
Mrc1
Msr1
Fscn1
Ccl8
Mmp14
Pla2g7
Mmp19
Apol7c
Hmox1
Fcgr1
Folr2
Fcrls
C1qc
Mgl2
Spic
Adora2b
Cd79a
Lgmn
Tlr2
Mmp12
Lifr
Gm26782
Retnla
Gsn
C1qb
Mmp13
Mafb
Gab1
Fxyd2
Lipa
Clec4e
Naaa
Lyz1
F13a1
H2-Ob
Chil3
Timd4
C1qa
Lrrc32
Txndc16
Il1rn
Lhfpl2
Sdc3
Cd207
Ctsl
Slc11a1
Aif1
H2-DMb1
Cd209f
Tbc1d4
Sh2b2
Rab32
Slc7a11
Lgals3
Ifi207
Gsr
G0s2
Mx1
Ifitm6
Cd38
Klk1
Pdpn
Cacnb3
Hbb-bt
Atr
Tnfaip2
9130015A21Rik
Adgre1
Thbd
Apoe
Rhov
Ms4a4c
Gla
C3
AI839979
Tppp3
Fnip2
F10
Mcoln2
Polr2l
Il1b
H2-M2
Cyp7b1
Wfdc17
Arg2
Ly86
Usp18
Gatm
Yae1d1
Il10
Fgd2
Zmynd15
Il27
Cox6a2
2010013B24Rik
Clec4b1
Cmtr1
Mzb1
Cdc42ep2
Cebpb
Ccl24
Ftl1
Oasl1
Anxa3
Ear2
Ctsk
Ccl12
Zfp36
Akr1b8
Ecm1
Tnfsf9
Rnase4
Bcl2l14
Igf1
Pf4
Ifi204
Pltp
Cbr3
Pmp22
Ccl2
Il1a
Lepr
Clec4d
A330069E16Rik
Arl11
Hpgd
Ace
Cd209d
Flrt3
Adam23
Clec10a
Daglb
Ccl22
Mrgpra2a
Tbc1d8
Npy
Cd300e
Fcrla
Gas6
Cpne2
Ly6i
Olfm1
Dennd4b
Cxcl16
Cfh
Cstb
Asprv1
Procr
Fmnl2
Fam102b
Dapk1
Cfb
Apod
Rogdi
Spint1
Pdxk
Spib
Rnf128
Emp1
Mmp8
Ly6d
Slc43a2
Klra2
Nlrp3
Nrg1
Ms4a7
Slc16a3
Cd33
Camk1
Rnasel
Ralgds
Ggct
Dntt
Plac8
Selenop
Arrdc4
Gas2l3
Fem1b
Il23a
Slco4a1
Sulf2
Cited4
Blnk
Arg1
9130230L23Rik
Mreg
Cxcl1