Interpretable Patches - Vu#

This tutorial attempts to find wounding related genes using only the data originating from v3 sequencing runs contained in the Vu (2022) dataset.
The data is a combination of aged & young cells collected at time points 0DPW (uninjured), 4DPW and 7DPW.
Throughout this tutorial, we will use the workflow package with Patches to find genes that are associated with the 4DPW timepoint.

# Minimal mports
from ladder.data import get_data
from ladder.scripts import InterpretableWorkflow # Our workflow object to run the interpretable model
import umap, torch, pyro, os # To set seeds + umaps
import torch.optim as opt # For defining out optimizer


# For plotting
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# For data loading
import anndata as ad
import scanpy as sc

Formatting The Data#

The workflows use anndata objects as input. Specifically, the data we provide can also be manually downloaded here.

# Download the data object
get_data("Vu") 

Object saved at ./data/vu_2022_ay_wh.h5ad

# Load the anndata object
anndata = ad.read_h5ad("data/vu_2022_ay_wh.h5ad")
anndata.layers["normalized"] = anndata.X

# Find/subset HVGs & swap to raw counts
sc.pp.highly_variable_genes(anndata, n_top_genes=3000, batch_key="sample")
sc.pl.highly_variable_genes(anndata)

anndata = anndata[:, anndata.var["highly_variable"]]
anndata.X = anndata.layers["counts"]

../_images/310e900a5d2fb2334bf89fcfc76400c672d16df13aebd5017e4c7949c7362a2b.png

anndata

View of AnnData object with n_obs × n_vars = 26966 × 3000
    obs: 'age', 'time', 'sample', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'n_genes', 'doublet_score', 'predicted_doublet', 'leiden', 'broad_type'
    var: 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'age_colors', 'broad_type_colors', 'dendrogram_broad_type', 'dendrogram_leiden', 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'sample_colors', 'scrublet', 'time_colors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts', 'normalized'
    obsp: 'connectivities', 'distances'

Initializing the Interpretable Workflow#

Models can be run through two workflow interfaces: InterpretableWorkflow and CrossConditionWorkflow (see the corresponding tutorial)
Once initialized, you can train and evaluate the model through the workflow interface.

# Initialize workflow object
workflow = InterpretableWorkflow(anndata, verbose=True, random_seed=42)

# Define the condition classes & batch key to prepare the data
factors = ["time", "age", "broad_type"]
workflow.prep_model(factors, batch_key="sample", model_type='Patches', model_args={'ld_normalize' : True})

workflow.run_model(max_epochs=1000, convergence_threshold=1e-5, convergence_window=1000) # Lower the convergence threshold if you need a more accurate model, will increase training time
workflow.save_model("params/vu_ld")

Observing the Losses#

workflow.plot_loss()

../_images/3b609c48a31fefd0499dfbc0392ace4aa15f5d39c9359e09d330cf32caec682f.png

Obtaining Latent Embeddings#

workflow.write_embeddings()
workflow.anndata.obsm

Written embeddings to object 'anndata.obsm' under workflow.

AxisArrays with keys: X_pca, X_umap, patches_w_latent, patches_z_latent

Evaluation of Model Reconstruction#

workflow.evaluate_reconstruction()

Calculating RMSE ...
Calculating Profile Correlation ...
Calculating 2-Sliced Wasserstein ...
Calculating Chamfer Discrepancy ...
Results
===================
RMSE : 8.685 +- 0.311
Profile Correlation : 0.962 +- 0.0
2-Sliced Wasserstein : 9.368 +- 0.701
Chamfer Discrepancy : 121.983 +- 38.043

{'RMSE': [8.685, 0.311],
 'Profile Correlation': [0.962, 0.0],
 '2-Sliced Wasserstein': [9.368, 0.701],
 'Chamfer Discrepancy': [121.983, 38.043]}

Interpretability - Common and Conditional Gene Loadings#

Patches can generate condition-specific scores for each gene, describing the association of the gene and the condition, even if the conditions were always encountered combinatorially.
Positive/Negative values mean that the given gene tends to be expressed relatively more/less for that condition compared to the basal state.

workflow.get_conditional_loadings()
workflow.get_common_loadings()
workflow.anndata.var

Written condition specific loadings to 'self.anndata.var'.
Written common loadings to 'self.anndata.var'.

	mt	ribo	hb	n_cells_by_counts	mean_counts	log1p_mean_counts	pct_dropout_by_counts	total_counts	log1p_total_counts	highly_variable	...	Fibroblast_score_Patches	Granulocyte_score_Patches	Keratinocyte_score_Patches	Macrophage_score_Patches	Melanocyte_score_Patches	Neutrophil_score_Patches	Pericyte_score_Patches	Skeletal Muscle_score_Patches	T Cell_score_Patches	common_score_Patches
Sox17	False	False	False	447	0.025372	0.025055	99.026504	1165.0	7.061335	True	...	-0.172691	-0.213733	-0.064396	-0.231657	0.060180	-0.209632	-0.115450	0.081388	-0.108086	0.096553
Prex2	False	False	False	4967	0.223730	0.201903	89.182656	10273.0	9.237371	True	...	0.317856	-0.011027	-0.074715	-0.212266	0.028524	-0.037938	-0.004642	-0.077855	0.037797	0.032534
Slco5a1	False	False	False	1513	0.055491	0.054006	96.704924	2548.0	7.843456	True	...	0.222914	0.036253	0.057877	0.100015	-0.198793	0.147040	-0.023676	-0.251914	0.016166	0.692136
Jph1	False	False	False	852	0.020994	0.020777	98.144478	964.0	6.872128	True	...	0.172165	-0.114392	0.103020	-0.072656	0.072942	0.184129	-0.068458	0.068210	-0.286386	0.446361
Pi15	False	False	False	3381	0.195069	0.178204	92.636714	8957.0	9.100303	True	...	0.373721	-0.020328	0.234529	0.174611	0.064675	-0.166265	0.210245	-0.003035	-0.071886	0.256239
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Nrap	False	False	False	455	0.023325	0.023057	99.009082	1071.0	6.977282	True	...	-0.071591	0.169128	-0.172085	-0.043409	-0.007615	0.065033	0.030125	0.064808	-0.013635	-0.142112
Slc18a2	False	False	False	268	0.010279	0.010227	99.416338	472.0	6.159095	True	...	-0.054472	0.004801	0.046696	-0.108449	0.107959	-0.108925	0.136082	0.107105	-0.030343	-0.011968
Csf2ra	False	False	False	8080	0.353791	0.302908	82.403032	16245.0	9.695602	True	...	-0.031811	-0.172225	-0.039741	0.132805	0.180531	0.028858	-0.052445	0.145707	-0.058978	0.067027
AC133103.1	False	False	False	114	0.002962	0.002957	99.751726	136.0	4.919981	True	...	-0.181475	0.000465	-0.020807	0.096036	-0.083226	0.044714	-0.031599	-0.170607	0.073240	0.330647
AC168977.1	False	False	False	212	0.007100	0.007075	99.538297	326.0	5.789960	True	...	0.149703	0.001548	-0.214790	-0.002358	-0.134699	0.137109	0.022976	0.021951	0.250528	-0.161249

3000 rows × 31 columns

Bonus#

Run the cell below to get the top 200 genes
Copy & Paste the output to the Gene Ontology Resource
Select ‘biological process’ and ‘Mus Musculus’ from the list below and click launch.

for gene in (workflow.anndata.var["Fibroblast_score_Patches"]).sort_values(ascending=False)[:200].index:
    print(gene)

Dcn
Col1a2
Col1a1
Col3a1
Igfbp2
Sparc
Mfap4
Cpxm1
Gsn
Tnfaip6
Cilp
Serpina3n
Cpz
Efemp1
Tnc
Ogn
Sfrp2
Cxcl13
Scara5
Eln
Mmp3
Gpx3
Tnmd
Pi16
Ptx3
Postn
Cthrc1
Cyp26b1
Cyp2f2
Nov
Igfbp6
Saa3
Fmod
Fibin
Has1
Thbs4
Prg4
Lox
C1qtnf3
Peg3
Sfrp4
Col14a1
Gfpt2
Lbp
Stc1
Timp1
Rxfp1
Mme
Ugdh
C4b
Hmcn2
Igfbp4
Mfap5
Adamtsl2
C3
Cyr61
Itm2a
Pcsk5
Prss35
Tnn
Slc26a7
Col12a1
Plod2
Gas1
Pla1a
Adamts5
Aspn
Fbln7
Gdf10
Smpd3
Meg3
Col11a1
Chodl
Gldn
Coch
Grem1
Tgfb2
Gm16685
Dkk2
Mmp11
Apod
Myoc
Gpc3
Sned1
Cxcl12
Igf1
Ace
Cdkn2a
Scube3
Tctn1
Col8a1
Stmn2
Ptgis
Sfrp1
Cxcl5
Smoc2
Ndnf
Cdkn1c
Cp
Inmt
Col7a1
F13a1
Itih5
Wfdc12
Abca8a
Ccl19
Hgf
Itga2b
Col5a3
Ackr3
Ccl11
Ccl2
Timp3
Serpine1
Fgl2
Atp1a2
Efhd1
Il1r2
Csrp2
Ednrb
Slco2b1
Matn4
Lrrc15
Dnm3
Lgr5
Gbp2
Iigp1
Tmem158
Bmp4
Lepr
Notum
Gm12840
Cxcl9
Cygb
Pmp22
Dnm3os
Vwa5a
Hp
Rorb
F2r
Adm
Gpha2
Nol3
Rtn2
F830016B08Rik
Serpinb2
Sox11
Fam198b
Vamp5
Mapk12
Gatm
Nrep
Igtp
Nos2
1500015O10Rik
Plac8
Acan
Ccl7
Ifi205
Ptch1
Spp1
Depdc5
Pfn2
Col23a1
Tac1
Crabp1
Crispld2
H19
Irf1
Apoe
Malat1
Gm26532
Anxa3
Lmcd1
Mdk
Hspb2
Corin
Sncb
Kcnj8
Nexn
Gm6093
Snta1
Acyp2
Rarres1
Flnc
Cers5
Sgk1
Hist1h4i
Lsmem1
Edn3
Spon1
Pappa2
Gm42047
Ccl8
Pde4dip
Crlf1
Sox18
Ptn
Unc45b
Adamts1

Interpretable Patches - Vu

Contents

Interpretable Patches - Vu#

Formatting The Data#

Initializing the Interpretable Workflow#

Observing the Losses#

Obtaining Latent Embeddings#

Evaluation of Model Reconstruction#

Interpretability - Common and Conditional Gene Loadings#

Bonus#