Note

This page was generated from Lineage_Tutorial.ipynb. Interactive online version: Colab badge.

Demo: Lineage-recorded data from Chan et al, (Nature 2019)#

This demo uses data from:

Molecular recording of mammalian embryogenesis

For this demo, you need the expression data (count matrix) for Embryo3 which can be found in the NCBI GEO database under accession GSM3302829

Specifically these three files:

  • GSM3302829_embryo3_SeuratBarcodes.tsv.gz

  • GSM3302829_embryo3_SeuratGenes.tsv.gz

  • GSM3302829_embryo3_SeuratNorm.mtx.gz

The cell-type kernels (for interpreting lineage groups) are located at GSE122187

  • GSE122187_CellStateKernels.xls.gz

Additionally, a reconstructed lineage tree is provided with this notebook (same directory) as made by Cassiopeia

  • 0726_E2-2_tree_greedy_priors.processed.txt

Covered Here#

  • Data preprocessing and filtering

  • Computing Autocorrelation in Hotspot to identify lineage-related genes

  • Computing local correlations between lineage genes to identify heritable modules

  • Plotting modules, correlations, and module scores

[1]:
import sys

#if branch is stable, will install via pypi, else will install from source
branch = "anndata"
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    !pip install --quiet --upgrade numba
    !pip install --quiet git+https://github.com/yoseflab/hotspot@$branch#egg=hotspot
    !pip install --quiet scanpy
    !pip install --quiet mplscience
    !pip install --quiet ete3
    !pip3 install --upgrade xlrd==1.2.0
Requirement already satisfied: xlrd==1.2.0 in /usr/local/lib/python3.7/dist-packages (1.2.0)
[2]:
import warnings; warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import hotspot
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns
import mplscience

from scipy.io import mmread
from scipy.sparse import csr_matrix

import scanpy as sc
[3]:
%%capture
if IN_COLAB:
    !wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3302nnn/GSM3302829/suppl/GSM3302829_embryo3_SeuratBarcodes.tsv.gz
    !wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3302nnn/GSM3302829/suppl/GSM3302829_embryo3_SeuratGenes.tsv.gz
    !wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3302nnn/GSM3302829/suppl/GSM3302829_embryo3_SeuratNorm.mtx.gz
    !wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE122nnn/GSE122187/suppl/GSE122187_CellStateKernels.xls.gz
[4]:
counts_raw = mmread("GSM3302829_embryo3_SeuratNorm.mtx.gz")
counts_raw = csr_matrix(counts_raw)
barcodes = pd.read_table("GSM3302829_embryo3_SeuratBarcodes.tsv.gz", header=None)[0]
barcodes = [x+'-1' for x in barcodes] # to match the newick file
genes = pd.read_table("GSM3302829_embryo3_SeuratGenes.tsv.gz", header=None)[0]
[5]:
%%capture
import gdown
url = "https://drive.google.com/uc?id=1FlrryhNv9Khwo_BstUTPCBOVaMgZLZ9N"
output = 'tree.txt'
gdown.download(url, output, quiet=False)
[6]:
# Load the tree and enumerate the leaves
from ete3 import Tree
tree = Tree("tree.txt", format=1)
leaves = set()
for tn in tree.traverse('postorder'):
    if tn.is_leaf():
        leaves.add(tn.name)
len(leaves)
[6]:
1756
[7]:
# Subset the count matrix to only the cells where the lineage was recoverable
is_valid = [x in leaves for x in barcodes]
is_valid_indices = np.nonzero(is_valid)[0]
valid_barcodes = [barcodes[i] for i in is_valid_indices]
[8]:
adata = sc.AnnData(counts_raw.transpose())
adata.obs_names = barcodes
adata.var_names = genes

adata = adata[valid_barcodes]
sc.pp.filter_genes(adata, min_cells=10)
Trying to set attribute `.var` of view, copying.

Creating the Hotspot object#

To start an analysis, first create the hotspot object When creating the object, you need to specify:

  • The gene counts matrix

  • Which background model to use

  • The latent space we are using to compute our cell metric

    • Here we use the inferred cell tree

  • The per-cell scaling factor

    • Here we use the number of umi per barcode

In this case, only the log-normalized counts are made available. In the Hotspot publication, we used the raw counts and the negative binomial (‘danb’) model. However, to make this example easier to run, here we just supply the log-normalized counts and select the ‘normal’ model.

Once the object is created, the neighborhood is then computed with create_knn_graph

The two options that are specificied are n_neighbors which determines the size of the neighborhood, and weighted_graph.

Here we set weighted_graph=False to just use binary, 0-1 weights (only binary weights are supported when using a lineage tree) and n_neighbors=30 to create a local neighborhood size of the nearest 30 cells. Larger neighborhood sizes can result in more robust detection of correlations and autocorrelations at a cost of missing more fine-grained, smaller-scale patterns.

[9]:
# Create the Hotspot object and the neighborhood graph

hs = hotspot.Hotspot(adata, model='normal', tree=tree)

hs.create_knn_graph(
    weighted_graph=False, n_neighbors=30,
)
100%|██████████| 1756/1756 [00:21<00:00, 82.89it/s]

Determining genes with heritable variation#

Now we compute autocorrelations for each gene, using the lineage metric, to determine which genes have the most heritable variation.

[10]:
hs_results = hs.compute_autocorrelations(jobs=1)

hs_results.head(15)
100%|██████████| 12440/12440 [00:13<00:00, 948.13it/s]
[10]:
C Z Pval FDR
Gene
Rhox9 0.396389 67.273645 0.000000e+00 0.000000e+00
Gjb3 0.332650 59.970016 0.000000e+00 0.000000e+00
Ldoc1 0.300900 52.505884 0.000000e+00 0.000000e+00
Utf1 0.302944 51.278431 0.000000e+00 0.000000e+00
Gm9 0.293280 51.029026 0.000000e+00 0.000000e+00
Rhox6 0.277690 48.984987 0.000000e+00 0.000000e+00
Tfap2c 0.273967 44.759350 0.000000e+00 0.000000e+00
Pramef12 0.239486 43.962746 0.000000e+00 0.000000e+00
S100g 0.226116 40.532593 0.000000e+00 0.000000e+00
Apom 0.218111 39.612019 0.000000e+00 0.000000e+00
Fthl17a 0.225255 39.017489 0.000000e+00 0.000000e+00
Rbp4 0.217138 38.926022 0.000000e+00 0.000000e+00
Apoa1 0.214174 38.872021 0.000000e+00 0.000000e+00
Ttr 0.210956 38.213292 0.000000e+00 0.000000e+00
Rhox5 0.210522 37.540306 1.013971e-308 8.409198e-306

Grouping genes into lineage-based modules#

To get a better idea of what heritable expression patterns exist, it is helpful to group the genes into modules.

Hotspot does this using the concept of “local correlations” - that is, correlations that are computed between genes between cells in the same neighborhood.

Here we avoid running the calculation for all Genes x Genes pairs and instead only run this on genes that have significant lineage autocorrelation to begin with.

The method compute_local_correlations returns a Genes x Genes matrix of Z-scores for the significance of the correlation between genes. This object is also retained in the hs object and is used in the subsequent steps.

[11]:
# Select the genes with significant lineage autocorrelation
hs_genes = hs_results.index[hs_results.FDR < 0.05]

# Compute pair-wise local correlations between these genes
lcz = hs.compute_local_correlations(hs_genes, jobs=2)

Computing pair-wise local correlation on 2094 features...
100%|██████████| 2094/2094 [00:00<00:00, 2804.45it/s]
100%|██████████| 2191371/2191371 [09:51<00:00, 3703.71it/s]

Now that pair-wise local correlations are calculated, we can group genes into modules.

To do this, a convenience method is included create_modules which performs agglomerative clustering with two caveats:

  • If the FDR-adjusted p-value of the correlation between two branches exceeds fdr_threshold, then the branches are not merged.

  • If two branches are two be merged and they are both have at least min_gene_threshold genes, then the branches are not merged. Further genes that would join to the resulting merged module (and are therefore ambiguous) either remain unassigned (if core_only=True) or are assigned to the module with the smaller average correlations between genes, i.e. the least-dense module (if core_only=False)

The output is a Series that maps gene to module number. Unassigned genes are indicated with a module number of -1

This method was used to preserved substructure (nested modules) while still giving the analyst some control. However, since there are a lot of ways to do hierarchical clustering, you can also manually cluster using the gene-distances in hs.local_correlation_z

[12]:
modules = hs.create_modules(
    min_gene_threshold=50, core_only=True, fdr_threshold=0.05
)

modules.value_counts()
[12]:
 1    772
-1    486
 3    476
 2    209
 5     89
 4     62
Name: Module, dtype: int64

Plotting module correlations#

A convenience method is supplied to plot the results of hs.create_modules

[13]:
hs.plot_local_correlations()
_images/Lineage_Tutorial_19_0.png

To explore individual genes, we can look at the genes with the top autocorrelation in a given module as these are likely the most informative.

[14]:
# Show the top genes for a module

module = 3

results = hs.results.join(hs.modules)
results = results.loc[results.Module == module]
results.sort_values('Z', ascending=False).head(10)
[14]:
C Z Pval FDR Module
Gene
Marcksl1 0.140338 25.935066 1.340097e-148 2.976930e-146 3.0
Tuba1a 0.137344 25.162903 5.104844e-140 1.058404e-137 3.0
Tmsb10 0.118217 21.505837 6.864326e-103 1.054225e-100 3.0
Ybx1 0.122225 20.959820 7.633227e-98 1.130445e-95 3.0
H3f3a 0.116649 20.680113 2.615923e-95 3.828481e-93 3.0
Fabp5 0.112799 20.296989 6.835413e-92 9.046015e-90 3.0
2810417H13Rik 0.112131 20.175330 8.064243e-91 1.055991e-88 3.0
Ifitm2 0.105543 19.810383 1.211183e-87 1.537460e-85 3.0
Mest 0.101794 19.179827 2.728093e-82 3.201649e-80 3.0
Stmn1 0.107170 18.909426 4.769431e-80 5.493677e-78 3.0

To get an idea of what lineage patterns the module is referencing, we can plot the kernel values

[16]:
# Load Kernel values
import gzip
with gzip.open("GSE122187_CellStateKernels.xls.gz") as fin:
    kernels = pd.read_excel(fin, skiprows=1).set_index('Gene')

kernels = kernels.loc[
    :, kernels.columns.sort_values()
]

kernel_name_map = {
    0: "trophoblast stem cells",
    1: "neural ectoderm anterior",
    2: "primitive streak late",
    3: "anterior primitive streak",
    4: "primitive/definitive endoderm",
    5: "allantois",
    6: "secondary heart field/splanchnic lateral plate",
    7: "gut endoderm",
    8: "ectoderm early 1",
    9: "primitive blood early",
    10: "preplacodal ectoderm",
    11: "neural ectoderm posterior",
    12: "posterior lateral plate mesoderm",
    13: "hematopoietic/endothelial progenitors",
    14: "parietal endoderm",
    15: "amnion mesoderm early",
    16: "surface ectoderm",
    17: "epiblast",
    18: "somites",
    19: "ectoderm early 2",
    20: "splanchnic lateral plate/anterior paraxial mesoderm",
    21: "primitive heart tube",
    22: "primitive blood late",
    23: "notochord",
    24: "fore/midbrain",
    25: "distal extraembryonic ectoderm",
    26: "neuromesodermal progenitor early",
    27: "primordial germ cells",
    28: "differentiated trophoblasts",
    29: "visceral endoderm early",
    30: "presomitic mesoderm",
    31: "neuromesodermal progenitor late",
    32: "angioblasts",
    33: "neural crest",
    34: "pharyngeal arch mesoderm",
    35: "similar to neural crest",
    36: "primitive blood progenitors",
    37: "primitive streak early",
    38: "node",
    39: "future spinal cord",
    40: "visceral endoderm late",
    41: "amnion mesoderm late",
}

kernel_name_map = {k: v.capitalize() for k, v in kernel_name_map.items()}

kernels.columns = [kernel_name_map[x] for x in kernels.columns]

# Standardize each gene
kernels_z = kernels \
    .subtract(kernels.mean(axis=1), axis=0) \
    .divide(kernels.std(axis=1), axis=0)


# Plot the kernel values for the genes in each module
mods = hs.modules.unique()
mods = [x for x in mods if x != -1 and not np.isnan(x)]
mods = sorted(mods)

sizes = [(hs.modules == x).sum() for x in mods]
sizes = [np.log10(x) for x in sizes]

fig, axs = plt.subplots(
    len(mods), 1, gridspec_kw=dict(height_ratios=sizes, right=.8),
    figsize=(9, 12.5)
)

cbax = fig.add_axes([.85, .15, .015, .1])

for ax, mod in zip(axs.ravel(), mods):

    plt.sca(ax)
    mod_genes = hs.modules.index[hs.modules == mod]
    mod_genes = pd.Index(mod_genes) & kernels.index

    last_plot = mod == max(mods)

    if last_plot:
        cbar=True
        cbar_ax=cbax
    else:
        cbar=False
        cbar_ax=None

    sns.heatmap(kernels_z.loc[mod_genes], vmin=-2, vmax=2,
                cmap="RdBu_r", ax=ax, xticklabels=True, rasterized=True,
                cbar=cbar, cbar_ax=cbar_ax)
    plt.ylabel(int(mod))

    plt.yticks([])

    if last_plot:
        ax.tick_params(labelsize=9)
    else:
        plt.xticks([])

plt.tight_layout()

_images/Lineage_Tutorial_23_0.png

Summary Module Scores#

To aid in the recognition of the general behavior of a module, Hotspot can compute aggregate module scores.

[17]:
module_scores = hs.calculate_module_scores()

module_scores.head()
Computing scores for 5 modules...
100%|██████████| 5/5 [00:06<00:00,  1.24s/it]
[17]:
1 2 3 4 5
AAACCTGAGAATAGGG-1 -6.104505 -0.340111 5.592632 1.621748 -1.928659
AAACCTGAGACCCACC-1 -7.555633 -0.340970 7.487871 1.312326 -0.550077
AAACCTGAGCAGCGTA-1 3.669905 -0.920127 -2.848117 -0.541240 -0.420063
AAACCTGAGCCCAATT-1 2.893500 -0.474523 -2.123254 -0.329773 -0.943885
AAACCTGAGTACGACG-1 5.467003 -0.505570 -2.780876 -0.646889 -1.358990

Here we can visualize these module scores by plotting them over a UMAP of the cells

First we’ll compute the UMAP

[18]:
adata.raw = adata
[19]:
sc.pp.scale(adata)
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

Next, let’s use the Kernels to assign every cell to its closest profile

[20]:
from scipy.spatial.distance import cdist

common = adata.var_names & kernels.index
dd = cdist(adata.raw[:, common].X.A, kernels.loc[common].T, metric='cosine')
dd = pd.DataFrame(dd, index=adata.obs_names, columns=kernels.columns)
assignment = dd.idxmin(axis=1).loc[adata.obs_names]
[22]:
adata.obs["assignment"] = assignment
fig, ax = plt.subplots(figsize=(6, 6))
sc.pl.umap(adata, color="assignment", frameon=False, ax=ax)
... storing 'assignment' as categorical
_images/Lineage_Tutorial_31_1.png
[23]:
module_cols = []
for c in module_scores.columns:
    key = f"Module {c}"
    adata.obs[key] = module_scores[c]
    module_cols.append(key)

Finally, we plot the module scores on top of the UMAP for comparison

[24]:
with mplscience.style_context():
    sc.pl.umap(adata, color=module_cols, frameon=False, vmin=-1, vmax=1)
_images/Lineage_Tutorial_34_0.png