Interpretable Patches - Mascharak#

This tutorial attempts to find wounding related genes using only the data originating from v3 sequencing runs contained in the Vu (2022) dataset.
The data is a combination of aged & young cells collected at time points 0DPW (uninjured), 4DPW and 7DPW.
Throughout this tutorial, we will use the workflow package with Patches to find genes that are associated with the 4DPW timepoint.

# Minimal mports
from ladder.data import get_data
from ladder.scripts import InterpretableWorkflow # Our workflow object to run the interpretable model
import umap, torch, pyro, os # To set seeds + umaps
import torch.optim as opt # For defining out optimizer


# For plotting
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# For data loading
import anndata as ad
import scanpy as sc

Formatting The Data#

The workflows use anndata objects as input. Specifically, the data we provide can also be manually downloaded here.

# Download the data object
get_data("Mascharak") 

Object saved at ./data/mascharak_2022_tn_wh.h5ad

# Load the anndata object
anndata = ad.read_h5ad("data/mascharak_2022_tn_wh.h5ad")
anndata.layers["normalized"] = anndata.X

# Find/subset HVGs & swap to raw counts
sc.pp.highly_variable_genes(anndata, n_top_genes=3000)
sc.pl.highly_variable_genes(anndata)

anndata = anndata[:, anndata.var["highly_variable"]]
anndata.X = anndata.layers["counts"]

../_images/a600a7d280fd9ca9be7c2867cb27ac8b8920f3706d16301496bcbd04548e21e9.png

anndata

View of AnnData object with n_obs × n_vars = 2284 × 3000
    obs: 'time', 'treatment', 'sample', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'n_counts', 'leiden', 'broad_type'
    var: 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'broad_type_colors', 'dendrogram_broad_type', 'dendrogram_leiden', 'hvg', 'leiden', 'leiden_colors', 'log1p', 'neighbors', 'pca', 'rank_genes_groups', 'sample_colors', 'time_colors', 'treatment_colors', 'umap'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    layers: 'counts', 'normalized'
    obsp: 'connectivities', 'distances'

Initializing the Interpretable Workflow#

Models can be run through two workflow interfaces: InterpretableWorkflow and CrossConditionWorkflow (see the corresponding tutorial)
Once initialized, you can train and evaluate the model through the workflow interface.

# Initialize workflow object
workflow = InterpretableWorkflow(anndata, verbose=True, random_seed=42)

# Define the condition classes & batch key to prepare the data
factors = ["time", "treatment", "broad_type"]
workflow.prep_model(factors, model_type='Patches', model_args={'ld_normalize' : True})

workflow.run_model(max_epochs=10000, convergence_threshold=1e-5, convergence_window=10000) # Lower the convergence threshold if you need a more accurate model, will increase training time
workflow.save_model("params/mascharak_ld")

Observing the Losses#

workflow.plot_loss()

../_images/8ea127d3512eeb874f1a5eee68f56d5a2c51e10b50d4e007636ebc2c85c2934a.png

Obtaining Latent Embeddings#

workflow.write_embeddings()
workflow.anndata.obsm

Written embeddings to object 'anndata.obsm' under workflow.

AxisArrays with keys: X_pca, X_umap, patches_w_latent, patches_z_latent

Evaluation of Model Reconstruction#

workflow.evaluate_reconstruction()

Calculating RMSE ...
Calculating Profile Correlation ...
Calculating 2-Sliced Wasserstein ...
Calculating Chamfer Discrepancy ...
Results
===================
RMSE : 10.877 +- 0.757
Profile Correlation : 0.92 +- 0.005
2-Sliced Wasserstein : 21.077 +- 1.42
Chamfer Discrepancy : 159.829 +- 86.174

{'RMSE': [10.877, 0.757],
 'Profile Correlation': [0.92, 0.005],
 '2-Sliced Wasserstein': [21.077, 1.42],
 'Chamfer Discrepancy': [159.829, 86.174]}

Interpretability - Common and Conditional Gene Loadings#

Patches can generate condition-specific scores for each gene, describing the association of the gene and the condition, even if the conditions were always encountered combinatorially.
Positive/Negative values mean that the given gene tends to be expressed relatively more/less for that condition compared to the basal state.

workflow.get_conditional_loadings()
workflow.get_common_loadings()
workflow.anndata.var

Written condition specific loadings to 'self.anndata.var'.
Written common loadings to 'self.anndata.var'.

	mt	ribo	hb	n_cells_by_counts	mean_counts	log1p_mean_counts	pct_dropout_by_counts	total_counts	log1p_total_counts	highly_variable	...	Fibroblast_score_Patches	Gamma-delta T Cell_score_Patches	Granulocyte_score_Patches	Macrophage_score_Patches	Melanocyte_score_Patches	Neutrophil_score_Patches	Red Blood Cell_score_Patches	Skeletal Muscle_score_Patches	T Cell_score_Patches	common_score_Patches
Sox17	False	False	False	14	0.018039	0.017878	99.540833	55.0	4.025352	True	...	0.126722	0.005121	0.073156	-0.224584	-0.135115	0.113830	-0.104980	0.100207	0.134213	0.092391
Adhfe1	False	False	False	27	0.010495	0.010441	99.114464	32.0	3.496508	True	...	0.040870	0.103973	-0.257579	-0.281439	-0.132003	0.034305	0.099367	0.087370	0.023955	0.536997
Cpa6	False	False	False	20	0.007871	0.007841	99.344047	24.0	3.218876	True	...	0.115090	0.094494	-0.050590	0.071744	0.157219	0.069963	0.097167	-0.020337	-0.088484	0.356553
Prex2	False	False	False	133	0.088882	0.085151	95.637914	271.0	5.605802	True	...	0.177728	-0.063112	0.151966	0.002831	-0.114980	-0.188440	0.072596	0.014520	-0.039014	-0.161527
Sulf1	False	False	False	168	0.126927	0.119494	94.489997	387.0	5.961005	True	...	0.168314	-0.109854	0.161246	-0.019156	0.215137	0.149358	-0.025992	0.086490	0.068549	-0.280242
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Gfra1	False	False	False	80	0.065595	0.063534	97.376189	200.0	5.303305	True	...	0.198698	0.024065	-0.002350	-0.106101	-0.037805	-0.134026	0.049773	-0.050908	-0.183474	-0.232669
Slc18a2	False	False	False	33	0.036733	0.036075	98.917678	112.0	4.727388	True	...	-0.173744	0.156180	0.134977	-0.064383	0.069190	0.065861	-0.296098	-0.083482	0.225230	-0.013733
Rps12-ps3	False	True	False	368	0.145622	0.135947	87.930469	444.0	6.098074	True	...	-0.075951	-0.060358	0.086086	0.011824	-0.006700	0.076035	0.034543	0.037943	-0.018578	-0.259352
Emx2	False	False	False	69	0.030830	0.030364	97.736963	94.0	4.553877	True	...	0.201313	0.149887	0.189365	-0.197546	0.185826	-0.089447	-0.022594	-0.010877	-0.275604	0.682746
CR974586.5	False	False	False	74	0.074779	0.072115	97.572975	228.0	5.433722	True	...	0.304274	-0.162427	0.168100	-0.238681	-0.177201	-0.240768	-0.105501	-0.045831	-0.150040	-0.046855

3000 rows × 28 columns

Bonus#

Run the cell below to get the top 200 genes
Copy & Paste the output to the Gene Ontology Resource
Select ‘biological process’ and ‘Mus Musculus’ from the list below and click launch.

for gene in (workflow.anndata.var["Macrophage_score_Patches"]).sort_values(ascending=False)[:200].index:
    print(gene)

Lyz2
Lpl
Cd74
Cd14
Fn1
H2-Ab1
Ms4a1
H2-Aa
Psap
Thbs1
H2-Eb1
Map3k12
Ms4a6d
Mrc1
Msr1
Fscn1
Ccl8
Mmp14
Pla2g7
Mmp19
Apol7c
Hmox1
Fcgr1
Folr2
Fcrls
C1qc
Mgl2
Spic
Adora2b
Cd79a
Lgmn
Tlr2
Mmp12
Lifr
Gm26782
Retnla
Gsn
C1qb
Mmp13
Mafb
Gab1
Fxyd2
Lipa
Clec4e
Naaa
Lyz1
F13a1
H2-Ob
Chil3
Timd4
C1qa
Lrrc32
Txndc16
Il1rn
Lhfpl2
Sdc3
Cd207
Ctsl
Slc11a1
Aif1
H2-DMb1
Cd209f
Tbc1d4
Sh2b2
Rab32
Slc7a11
Lgals3
Ifi207
Gsr
G0s2
Mx1
Ifitm6
Cd38
Klk1
Pdpn
Cacnb3
Hbb-bt
Atr
Tnfaip2
9130015A21Rik
Adgre1
Thbd
Apoe
Rhov
Ms4a4c
Gla
C3
AI839979
Tppp3
Fnip2
F10
Mcoln2
Polr2l
Il1b
H2-M2
Cyp7b1
Wfdc17
Arg2
Ly86
Usp18
Gatm
Yae1d1
Il10
Fgd2
Zmynd15
Il27
Cox6a2
2010013B24Rik
Clec4b1
Cmtr1
Mzb1
Cdc42ep2
Cebpb
Ccl24
Ftl1
Oasl1
Anxa3
Ear2
Ctsk
Ccl12
Zfp36
Akr1b8
Ecm1
Tnfsf9
Rnase4
Bcl2l14
Igf1
Pf4
Ifi204
Pltp
Cbr3
Pmp22
Ccl2
Il1a
Lepr
Clec4d
A330069E16Rik
Arl11
Hpgd
Ace
Cd209d
Flrt3
Adam23
Clec10a
Daglb
Ccl22
Mrgpra2a
Tbc1d8
Npy
Cd300e
Fcrla
Gas6
Cpne2
Ly6i
Olfm1
Dennd4b
Cxcl16
Cfh
Cstb
Asprv1
Procr
Fmnl2
Fam102b
Dapk1
Cfb
Apod
Rogdi
Spint1
Pdxk
Spib
Rnf128
Emp1
Mmp8
Ly6d
Slc43a2
Klra2
Nlrp3
Nrg1
Ms4a7
Slc16a3
Cd33
Camk1
Rnasel
Ralgds
Ggct
Dntt
Plac8
Selenop
Arrdc4
Gas2l3
Fem1b
Il23a
Slco4a1
Sulf2
Cited4
Blnk
Arg1
9130230L23Rik
Mreg
Cxcl1

Interpretable Patches - Mascharak

Contents

Interpretable Patches - Mascharak#

Formatting The Data#

Initializing the Interpretable Workflow#

Observing the Losses#

Obtaining Latent Embeddings#

Evaluation of Model Reconstruction#

Interpretability - Common and Conditional Gene Loadings#

Bonus#