Note

This page was generated from Lineage_Tutorial.ipynb. Interactive online version: .

Demo: Lineage-recorded data from Chan et al, (Nature 2019)#

This demo uses data from:

Molecular recording of mammalian embryogenesis

Chan et al (Nature 2019)
https://doi.org/10.1038/s41586-019-1184-5

For this demo, you need the expression data (count matrix) for Embryo3 which can be found in the NCBI GEO database under accession GSM3302829

Specifically these three files:

GSM3302829_embryo3_SeuratBarcodes.tsv.gz
GSM3302829_embryo3_SeuratGenes.tsv.gz
GSM3302829_embryo3_SeuratNorm.mtx.gz

The cell-type kernels (for interpreting lineage groups) are located at GSE122187

GSE122187_CellStateKernels.xls.gz

Additionally, a reconstructed lineage tree is provided with this notebook (same directory) as made by Cassiopeia

0726_E2-2_tree_greedy_priors.processed.txt

Covered Here#

Data preprocessing and filtering
Computing Autocorrelation in Hotspot to identify lineage-related genes
Computing local correlations between lineage genes to identify heritable modules
Plotting modules, correlations, and module scores

[1]:

import sys

#if branch is stable, will install via pypi, else will install from source
branch = "anndata"
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    !pip install --quiet --upgrade numba
    !pip install --quiet git+https://github.com/yoseflab/hotspot@$branch#egg=hotspot
    !pip install --quiet scanpy
    !pip install --quiet mplscience
    !pip install --quiet ete3
    !pip3 install --upgrade xlrd==1.2.0

Requirement already satisfied: xlrd==1.2.0 in /usr/local/lib/python3.7/dist-packages (1.2.0)

[2]:

import warnings; warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import hotspot
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns
import mplscience

from scipy.io import mmread
from scipy.sparse import csr_matrix

import scanpy as sc

[3]:

%%capture
if IN_COLAB:
    !wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3302nnn/GSM3302829/suppl/GSM3302829_embryo3_SeuratBarcodes.tsv.gz
    !wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3302nnn/GSM3302829/suppl/GSM3302829_embryo3_SeuratGenes.tsv.gz
    !wget https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3302nnn/GSM3302829/suppl/GSM3302829_embryo3_SeuratNorm.mtx.gz
    !wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE122nnn/GSE122187/suppl/GSE122187_CellStateKernels.xls.gz

[4]:

counts_raw = mmread("GSM3302829_embryo3_SeuratNorm.mtx.gz")
counts_raw = csr_matrix(counts_raw)
barcodes = pd.read_table("GSM3302829_embryo3_SeuratBarcodes.tsv.gz", header=None)[0]
barcodes = [x+'-1' for x in barcodes] # to match the newick file
genes = pd.read_table("GSM3302829_embryo3_SeuratGenes.tsv.gz", header=None)[0]

[5]:

%%capture
import gdown
url = "https://drive.google.com/uc?id=1FlrryhNv9Khwo_BstUTPCBOVaMgZLZ9N"
output = 'tree.txt'
gdown.download(url, output, quiet=False)

[6]:

# Load the tree and enumerate the leaves
from ete3 import Tree
tree = Tree("tree.txt", format=1)
leaves = set()
for tn in tree.traverse('postorder'):
    if tn.is_leaf():
        leaves.add(tn.name)
len(leaves)

[6]:

[7]:

# Subset the count matrix to only the cells where the lineage was recoverable
is_valid = [x in leaves for x in barcodes]
is_valid_indices = np.nonzero(is_valid)[0]
valid_barcodes = [barcodes[i] for i in is_valid_indices]

[8]:

adata = sc.AnnData(counts_raw.transpose())
adata.obs_names = barcodes
adata.var_names = genes

adata = adata[valid_barcodes]
sc.pp.filter_genes(adata, min_cells=10)

Trying to set attribute `.var` of view, copying.

Creating the Hotspot object#

To start an analysis, first create the hotspot object When creating the object, you need to specify:

The gene counts matrix
Which background model to use
The latent space we are using to compute our cell metric
- Here we use the inferred cell tree
The per-cell scaling factor
- Here we use the number of umi per barcode

In this case, only the log-normalized counts are made available. In the Hotspot publication, we used the raw counts and the negative binomial (‘danb’) model. However, to make this example easier to run, here we just supply the log-normalized counts and select the ‘normal’ model.

Once the object is created, the neighborhood is then computed with create_knn_graph

The two options that are specificied are n_neighbors which determines the size of the neighborhood, and weighted_graph.

Here we set weighted_graph=False to just use binary, 0-1 weights (only binary weights are supported when using a lineage tree) and n_neighbors=30 to create a local neighborhood size of the nearest 30 cells. Larger neighborhood sizes can result in more robust detection of correlations and autocorrelations at a cost of missing more fine-grained, smaller-scale patterns.

[9]:

# Create the Hotspot object and the neighborhood graph

hs = hotspot.Hotspot(adata, model='normal', tree=tree)

hs.create_knn_graph(
    weighted_graph=False, n_neighbors=30,
)

100%|██████████| 1756/1756 [00:21<00:00, 82.89it/s]

Determining genes with heritable variation#

Now we compute autocorrelations for each gene, using the lineage metric, to determine which genes have the most heritable variation.

[10]:

hs_results = hs.compute_autocorrelations(jobs=1)

hs_results.head(15)

100%|██████████| 12440/12440 [00:13<00:00, 948.13it/s]

[10]:

	C	Z	Pval	FDR
Gene
Rhox9	0.396389	67.273645	0.000000e+00	0.000000e+00
Gjb3	0.332650	59.970016	0.000000e+00	0.000000e+00
Ldoc1	0.300900	52.505884	0.000000e+00	0.000000e+00
Utf1	0.302944	51.278431	0.000000e+00	0.000000e+00
Gm9	0.293280	51.029026	0.000000e+00	0.000000e+00
Rhox6	0.277690	48.984987	0.000000e+00	0.000000e+00
Tfap2c	0.273967	44.759350	0.000000e+00	0.000000e+00
Pramef12	0.239486	43.962746	0.000000e+00	0.000000e+00
S100g	0.226116	40.532593	0.000000e+00	0.000000e+00
Apom	0.218111	39.612019	0.000000e+00	0.000000e+00
Fthl17a	0.225255	39.017489	0.000000e+00	0.000000e+00
Rbp4	0.217138	38.926022	0.000000e+00	0.000000e+00
Apoa1	0.214174	38.872021	0.000000e+00	0.000000e+00
Ttr	0.210956	38.213292	0.000000e+00	0.000000e+00
Rhox5	0.210522	37.540306	1.013971e-308	8.409198e-306

Grouping genes into lineage-based modules#

To get a better idea of what heritable expression patterns exist, it is helpful to group the genes into modules.

Hotspot does this using the concept of “local correlations” - that is, correlations that are computed between genes between cells in the same neighborhood.

Here we avoid running the calculation for all Genes x Genes pairs and instead only run this on genes that have significant lineage autocorrelation to begin with.

The method compute_local_correlations returns a Genes x Genes matrix of Z-scores for the significance of the correlation between genes. This object is also retained in the hs object and is used in the subsequent steps.

[11]:

# Select the genes with significant lineage autocorrelation
hs_genes = hs_results.index[hs_results.FDR < 0.05]

# Compute pair-wise local correlations between these genes
lcz = hs.compute_local_correlations(hs_genes, jobs=2)

Computing pair-wise local correlation on 2094 features...

100%|██████████| 2094/2094 [00:00<00:00, 2804.45it/s]
100%|██████████| 2191371/2191371 [09:51<00:00, 3703.71it/s]

Now that pair-wise local correlations are calculated, we can group genes into modules.

To do this, a convenience method is included create_modules which performs agglomerative clustering with two caveats:

If the FDR-adjusted p-value of the correlation between two branches exceeds fdr_threshold, then the branches are not merged.
If two branches are two be merged and they are both have at least min_gene_threshold genes, then the branches are not merged. Further genes that would join to the resulting merged module (and are therefore ambiguous) either remain unassigned (if core_only=True) or are assigned to the module with the smaller average correlations between genes, i.e. the least-dense module (if core_only=False)

The output is a Series that maps gene to module number. Unassigned genes are indicated with a module number of -1

This method was used to preserved substructure (nested modules) while still giving the analyst some control. However, since there are a lot of ways to do hierarchical clustering, you can also manually cluster using the gene-distances in hs.local_correlation_z

[12]:

modules = hs.create_modules(
    min_gene_threshold=50, core_only=True, fdr_threshold=0.05
)

modules.value_counts()

[12]:

 1    772
-1    486
 3    476
 2    209
 5     89
 4     62
Name: Module, dtype: int64

Plotting module correlations#

A convenience method is supplied to plot the results of hs.create_modules

[13]:

hs.plot_local_correlations()

To explore individual genes, we can look at the genes with the top autocorrelation in a given module as these are likely the most informative.

[14]:

# Show the top genes for a module

module = 3

results = hs.results.join(hs.modules)
results = results.loc[results.Module == module]
results.sort_values('Z', ascending=False).head(10)

[14]:

	C	Z	Pval	FDR	Module
Gene
Marcksl1	0.140338	25.935066	1.340097e-148	2.976930e-146	3.0
Tuba1a	0.137344	25.162903	5.104844e-140	1.058404e-137	3.0
Tmsb10	0.118217	21.505837	6.864326e-103	1.054225e-100	3.0
Ybx1	0.122225	20.959820	7.633227e-98	1.130445e-95	3.0
H3f3a	0.116649	20.680113	2.615923e-95	3.828481e-93	3.0
Fabp5	0.112799	20.296989	6.835413e-92	9.046015e-90	3.0
2810417H13Rik	0.112131	20.175330	8.064243e-91	1.055991e-88	3.0
Ifitm2	0.105543	19.810383	1.211183e-87	1.537460e-85	3.0
Mest	0.101794	19.179827	2.728093e-82	3.201649e-80	3.0
Stmn1	0.107170	18.909426	4.769431e-80	5.493677e-78	3.0

To get an idea of what lineage patterns the module is referencing, we can plot the kernel values

[16]:

# Load Kernel values
import gzip
with gzip.open("GSE122187_CellStateKernels.xls.gz") as fin:
    kernels = pd.read_excel(fin, skiprows=1).set_index('Gene')

kernels = kernels.loc[
    :, kernels.columns.sort_values()
]

kernel_name_map = {
    0: "trophoblast stem cells",
    1: "neural ectoderm anterior",
    2: "primitive streak late",
    3: "anterior primitive streak",
    4: "primitive/definitive endoderm",
    5: "allantois",
    6: "secondary heart field/splanchnic lateral plate",
    7: "gut endoderm",
    8: "ectoderm early 1",
    9: "primitive blood early",
    10: "preplacodal ectoderm",
    11: "neural ectoderm posterior",
    12: "posterior lateral plate mesoderm",
    13: "hematopoietic/endothelial progenitors",
    14: "parietal endoderm",
    15: "amnion mesoderm early",
    16: "surface ectoderm",
    17: "epiblast",
    18: "somites",
    19: "ectoderm early 2",
    20: "splanchnic lateral plate/anterior paraxial mesoderm",
    21: "primitive heart tube",
    22: "primitive blood late",
    23: "notochord",
    24: "fore/midbrain",
    25: "distal extraembryonic ectoderm",
    26: "neuromesodermal progenitor early",
    27: "primordial germ cells",
    28: "differentiated trophoblasts",
    29: "visceral endoderm early",
    30: "presomitic mesoderm",
    31: "neuromesodermal progenitor late",
    32: "angioblasts",
    33: "neural crest",
    34: "pharyngeal arch mesoderm",
    35: "similar to neural crest",
    36: "primitive blood progenitors",
    37: "primitive streak early",
    38: "node",
    39: "future spinal cord",
    40: "visceral endoderm late",
    41: "amnion mesoderm late",
}

kernel_name_map = {k: v.capitalize() for k, v in kernel_name_map.items()}

kernels.columns = [kernel_name_map[x] for x in kernels.columns]

# Standardize each gene
kernels_z = kernels \
    .subtract(kernels.mean(axis=1), axis=0) \
    .divide(kernels.std(axis=1), axis=0)


# Plot the kernel values for the genes in each module
mods = hs.modules.unique()
mods = [x for x in mods if x != -1 and not np.isnan(x)]
mods = sorted(mods)

sizes = [(hs.modules == x).sum() for x in mods]
sizes = [np.log10(x) for x in sizes]

fig, axs = plt.subplots(
    len(mods), 1, gridspec_kw=dict(height_ratios=sizes, right=.8),
    figsize=(9, 12.5)
)

cbax = fig.add_axes([.85, .15, .015, .1])

for ax, mod in zip(axs.ravel(), mods):

    plt.sca(ax)
    mod_genes = hs.modules.index[hs.modules == mod]
    mod_genes = pd.Index(mod_genes) & kernels.index

    last_plot = mod == max(mods)

    if last_plot:
        cbar=True
        cbar_ax=cbax
    else:
        cbar=False
        cbar_ax=None

    sns.heatmap(kernels_z.loc[mod_genes], vmin=-2, vmax=2,
                cmap="RdBu_r", ax=ax, xticklabels=True, rasterized=True,
                cbar=cbar, cbar_ax=cbar_ax)
    plt.ylabel(int(mod))

    plt.yticks([])

    if last_plot:
        ax.tick_params(labelsize=9)
    else:
        plt.xticks([])

plt.tight_layout()

Summary Module Scores#

To aid in the recognition of the general behavior of a module, Hotspot can compute aggregate module scores.

[17]:

module_scores = hs.calculate_module_scores()

module_scores.head()

Computing scores for 5 modules...

100%|██████████| 5/5 [00:06<00:00,  1.24s/it]

[17]:

	1	2	3	4	5
AAACCTGAGAATAGGG-1	-6.104505	-0.340111	5.592632	1.621748	-1.928659
AAACCTGAGACCCACC-1	-7.555633	-0.340970	7.487871	1.312326	-0.550077
AAACCTGAGCAGCGTA-1	3.669905	-0.920127	-2.848117	-0.541240	-0.420063
AAACCTGAGCCCAATT-1	2.893500	-0.474523	-2.123254	-0.329773	-0.943885
AAACCTGAGTACGACG-1	5.467003	-0.505570	-2.780876	-0.646889	-1.358990

Here we can visualize these module scores by plotting them over a UMAP of the cells

First we’ll compute the UMAP

[18]:

adata.raw = adata

[19]:

sc.pp.scale(adata)
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

Next, let’s use the Kernels to assign every cell to its closest profile

[20]:

from scipy.spatial.distance import cdist

common = adata.var_names & kernels.index
dd = cdist(adata.raw[:, common].X.A, kernels.loc[common].T, metric='cosine')
dd = pd.DataFrame(dd, index=adata.obs_names, columns=kernels.columns)
assignment = dd.idxmin(axis=1).loc[adata.obs_names]

[22]:

adata.obs["assignment"] = assignment
fig, ax = plt.subplots(figsize=(6, 6))
sc.pl.umap(adata, color="assignment", frameon=False, ax=ax)

... storing 'assignment' as categorical

[23]:

module_cols = []
for c in module_scores.columns:
    key = f"Module {c}"
    adata.obs[key] = module_scores[c]
    module_cols.append(key)

Finally, we plot the module scores on top of the UMAP for comparison

[24]:

with mplscience.style_context():
    sc.pl.umap(adata, color=module_cols, frameon=False, vmin=-1, vmax=1)

Demo: Lineage-recorded data from Chan et al, (Nature 2019)

Contents

Demo: Lineage-recorded data from Chan et al, (Nature 2019)#

Covered Here#

Creating the Hotspot object#

Determining genes with heritable variation#

Grouping genes into lineage-based modules#

Plotting module correlations#

Summary Module Scores#