Abstract
Cancer is a highly heterogeneous disease, where phenotypically distinct
subpopulations coexist and can be primed to different fates. Both
genetic and epigenetic factors may drive cancer evolution, however
little is known about whether and how such a process is pre-encoded in
cancer clones. Using single-cell multi-omic lineage tracing and
phenotypic assays, we investigate the predictive features of either
tumour initiation or drug tolerance within the same cancer population.
Clones primed to tumour initiation in vivo display two distinct
transcriptional states at baseline. Remarkably, these states share a
distinctive DNA accessibility profile, highlighting an epigenetic basis
for tumour initiation. The drug tolerant niche is also largely
pre-encoded, but only partially overlaps the tumour-initiating one and
evolves following two genetically and transcriptionally distinct
trajectories. Our study highlights coexisting genetic, epigenetic and
transcriptional determinants of cancer evolution, unravelling the
molecular complexity of pre-encoded tumour phenotypes.
Subject terms: Cancer, Computational biology and bioinformatics, Gene
expression, Epigenomics
__________________________________________________________________
Here, using single-cell omics and genomic barcoding, the authors
identify transcriptional states and DNA accessibility profiles linked
to tumour initiation and drug tolerance, highlighting the complexity of
cancer. This study suggests that cancer evolution is driven by
pre-encoded factors.
Introduction
Cancer adopts evolutionary pathways that sustain the disease.
Aggressive tumour behaviours, such as the dissemination to distant
organs, diminished susceptibility to treatment, and disease relapse,
result from either selection or adaptation processes, possibly
intertwined^[46]1. When a selective process occurs, the fate of a
cancer clone is determined at the root of the evolutionary process. In
this case, the heterogeneity of tumour phenotypes can, at least in
principle, be identified ahead of selection^[47]2. The pre-existence of
aggressive phenotypes has been linked to the so-called cancer stem cell
(CSC) theory^[48]3 and observed in leukaemia^[49]4,[50]5 and solid
tumours, such as colon^[51]6 and breast cancer^[52]7,[53]8, as well as
glioma^[54]9,[55]10. According to such a model, tumour cells are not
all equal, instead a stem-like cancer niche exists that is primed to
sustain most of the aggressive phenotypes, such as tumour
re-initiation, metastatic dissemination potential, and capacity to
survive cytotoxic treatments^[56]11.
Predicting cancer phenotypes requires linking the molecular state of a
clone to its fate with high precision. Without a priori information,
tumour phylogeny can be inferred from somatic mutations^[57]12–[58]15;
however, this approach is limited by the high sparsity of single-cell
data. Single-cell lineage tracing consists in inserting barcodes in the
genome of the cells with the aim of tracing their
progeny^[59]16–[60]19. In cancer, this approach has been used to
investigate clonality in metastases^[61]20, survival upon cytotoxic
treatment^[62]21,[63]22, as well as to dissect the clonal origin of the
primary tumour and metastasis growth^[64]23–[65]26, possibly in
vivo^[66]27. However, these studies mainly focus on the evolutionary
trajectories, rather than on the driving molecular features of
pre-existing phenotypes.
Tumour evolutionary diversity can have either a genetic or non-genetic
origin^[67]28,[68]29. Single-cell multi-omics has recently emerged as a
promising tool to study cancer evolution^[69]30. Here, we combine
single-cell multi-omics with lineage tracing in a unique framework,
which allows simultaneous clonal, gene expression, and chromatin
accessibility profiling at single-cell resolution. Using phenotypic
assays on barcoded cells, we identify the clones endowed with
aggressive cancer behaviours typical of the stem-like cancer niche,
specifically tumour-initiating capacity and drug tolerance.
Subsequently, we extract robust transcriptional, epigenetic, and
genetic features of naïve cells and associate them to clonal
subpopulations. By integrating these multiple layers of information, we
identify the regulatory elements that predict cancer evolution in
response to adverse environmental conditions. Finally, by tracing the
transcriptional changes of clones across time, we unravel the role of
pre-existing molecular features in shaping the differentiation breadth
of stem-like subpopulations.
Results
SUM159PT exhibits high transcriptional plasticity and comprises three
distinct subpopulations: S1, S2, and S3
To investigate the potential of cancer cells to promote tumour
initiation and escape cytotoxic treatment, we combined single-cell
sequencing with phenotypic assays. We selected SUM159PT, a
triple-negative breast cancer cell line (TNBC), as a model system.
SUM159PT belongs to the claudin-low mesenchymal subtype^[70]31 and is
characterised by (i) a nearly diploid genotype, but bearing a specific
set of mutations typically associated with TNBC (HRAS, PIK3CA, TP53 and
MYC amplification^[71]32); (ii) an intrinsic heterogeneity, with an
underlying variability in the expression of epithelial and mesenchymal
genes and a small proportion of cells in a CSC state^[72]33,[73]34; and
(iii) an aggressive phenotype driven by the CSC component, which is
highly tumourigenic and invasive in vivo^[74]33–[75]35. A single-cell
lineage tracing approach was used to link the molecular state of a cell
to its fate (Fig. [76]1a and Supplementary Fig. [77]1a). To obtain
∼10,000 distinct genetic barcodes (GBC), 100,000 SUM159PT cells were
infected with a lentiviral pool at a multiplicity of infection
(MOI) = 0.1 and subsequently FAC-sorted to retain only the transduced
fraction^[78]18. Endogenous as well as GBC-carrying transcripts were
then captured by single-cell RNA-seq (scRNA-seq). The parental
population was sampled and processed at two time points, T0 and T1,
separated by 13–15 days (Fig. [79]1b). At basal state, between 5017 and
5996 unique clones were found in the two replicates and between 83% and
88% of high-quality cells were assigned a clone identity (Supplementary
Fig. [80]1b and Supplementary Data [81]1), making the lineage of even
rare cell subpopulations accessible to analysis. The distribution of
clones at the two time points was similar, highlighting that no
spontaneous clone selection occurs in the timeframe (Supplementary
Fig. [82]1c). Moreover, 68% and 57% of the clones respectively detected
in T0 and T1 were shared between the two time points, with >50% of
clones constituted by a single cell in each sample (Supplementary
Fig. [83]1d and Supplementary Data [84]1). When evaluating the
relationship between clonality and gene expression profile at basal
state, cells stemming from a common clone at the moment of infection,
hereafter sister cells, were on average only slightly more similar to
one another compared to non-sisters (Fig. [85]1c). We next asked
whether the transcriptional similarity between sister cells is
clone-specific—in other words, whether some clones show a distinctive
gene expression profile and other clones are more plastic [a similar
approach is proposed in ref. ^[86]17]. We detected seven distinct gene
expression clusters in T0 and T1, respectively (Fig. [87]1d,
Supplementary Fig. [88]1e, f and Supplementary Data [89]1) and compared
the clone content of every cluster pair across the two time points
using a clone sharedness score (see section “Methods” and Fig. [90]1e).
Most clones that clustered together in T0 were mapped to multiple
distinct clusters in T1, and vice versa, suggesting a high
transcriptional plasticity already at baseline. In contrast, three
cluster pairs in T0 and T1, respectively, comprising 28% and 23% of the
cells at the two time points, showed mutually high clone sharedness
(see section “Methods” and Fig. [91]1f); we conclude that these
subpopulations are transcriptionally stable and we will refer to them
as S1, S2, and S3 hereafter. They respectively comprise 3.6%, 14.7%,
and 7.4% cells on average. We obtained a gene expression signature for
each of them (Fig. [92]1g and Supplementary Data [93]2) that is
independent of cell-culture effect. Of note, S1 was enriched in genes
involved in collagen processing and matrix remodelling (Supplementary
Fig. [94]1g and Supplementary Data [95]3). S1 cells showed upregulation
of S100A4, a gene associated with metastatic
behaviours^[96]20,[97]36,[98]37, and TM4SF1, whose role in promoting
cell proliferation and invasion in epithelial tumours has been
assessed^[99]38–[100]41. The microRNA-205 host gene (MIR205HG) was
found as S2-specific and has been associated to basal cells^[101]42,
epithelial-to-mesenchymal (EMT) transition, and multiple cancer
diseases^[102]43,[103]44. The oncogene HMGA1 is part of the S2
signature and has been associated to the TNBC subtype^[104]45. S3 was
distinguished by the expression of FEZ1, a microtubule adaptor^[105]46,
and RPS25, a gene acting on cellular response to stress by
downregulating p53^[106]47. In conclusion, single-cell lineage tracing
revealed that SUM159PT exhibits high transcriptional plasticity, but
comprises three distinct, transcriptionally stable subpopulations.
Fig. 1. Lineage tracing identifies transcriptionally stable TNBC cell
subpopulations.
[107]Fig. 1
[108]Open in a new tab
a SUM159PT cells were infected with a lentiviral library of unique
barcodes (Perturb-seq GBC library) at 0.1 multiplicity of infection
(MOI). The readout of each cell is its lineage (genetic barcodes) and
gene expression profile (3′-end cDNA sequencing). Clone information is
overlaid on single-cell gene expression space. b Experimental design,
passive propagation. Top: barcoded SUM159PT cells from the same
infection experiment were processed by scRNA-seq at two passages (T0
and T1, n = 2 replicates each). Bottom: number of detected clones and
cells assigned to clones. c Gaussian kernel density of Euclidean
distances between sister cells (solid line) and non-sister cells
(dashed line) computed on a joint T0 and T1 space (see section
“Methods”). d UMAP representation of T0 (9395 cells) and T1 (13,562
cells) coloured by cluster; only cells assigned to clones are shown. e
Definition of clone sharedness score between clusters i and j. f Left:
rows are clusters in T0 (as in d), columns are clusters in T1, and
entries are clone sharedness scores for each pair. Rows and columns are
ordered by higher to lower scores. The three top-scoring pairs,
referred to as subpopulations (S1, S2, S3), are shown on UMAP (right).
g Subpopulation gene signatures. Rows are the 25 genes in the
subpopulation signature showing the highest log[2](FC) in T0, columns
are subpopulations split by time point, and entries are log[2](FC)
values between a subpopulation and its complement at the same time
point. The top 15 (S1) and top 4 (S2, S3) genes are labelled. The
surface marker TM4SF1, highlighted in red, is used for sorting the S1
subpopulation. h Stratification of breast cancer samples into molecular
subtypes by subpopulation signature activity in TCGA (top) and METABRIC
(bottom) datasets (NC not classified). i Association with tumour
meta-programmes from the Curated Cancer Cell Atlas. The columns are the
cells at T0 ordered by non-decreasing module score, computed on the
union of the top 50 significant genes of the meta-programme (in rows);
genes in S1 and S2 signatures are labelled. The bar plots show the
binned cell count for each subpopulation [a, b created with
Biorender.com released under a Creative Commons
Attribution-NonCommercial-NoDerivs 4.0 International license].
SUM159PT transcriptional heterogeneity is recapitulated in primary tumours
To assess the relevance of stable SUM159PT transcriptional programmes
in primary TNBC tumours, we leveraged the Molecular Taxonomy of Breast
Cancer International Consortium (METABRIC^[109]48) and The Cancer
Genome Atlas (TCGA^[110]49). Figure [111]1h reports the stratification
of breast cancer tumours according to high, medium and low gene
expression classes (see section “Methods”) in TCGA and METABRIC
datasets. S1 and S2 signatures were associated with the basal tumour
subtype (including claudin-low, which accounts for 9.8% of all tumours
in METABRIC) in both datasets (adj. p value < 0.001), whereas S1 and S3
with the claudin-low subtype (adj. p value < 0.001; Benjamini–Hochberg
correction), suggesting that the stable transcriptional programmes
identified in SUM159PT capture broad basal tumour features. We noted
that the S1 signature genes were organised into few distinct
co-expression blocks (Supplementary Fig. [112]2a), hinting that they
may be part of a network also in tumours. Therefore, we next evaluated
whether S1, S2, and S3 recapitulate intra-tumour heterogeneity in
scRNA-seq datasets from primary samples. In primary TNBC
tumours^[113]50, we could detect both S1 and S3 programmes and identify
S1^+ and S3^+ cell subsets accordingly (Supplementary Fig. [114]3a–c);
in particular, we detected a strong upregulation of S100A4 in the S1^+
subset (see Supplementary Fig. [115]3d and Supplementary Data [116]4).
Then, we leveraged the Curated Cancer Cell Atlas (3CA,
[117]https://www.weizmann.ac.il/sites/3CA/), containing scRNA-seq data
for over 1000 primary tumour samples from over 70 studies, along with
the associated gene meta-programmes^[118]51. We reasoned that a high
association of SUM159PT subpopulations with key tumour meta-programmes
would be a strong indication for the generalisability of the signatures
we defined. We noted that the aggregate meta-programme expression
across SUM159PT clusters was non-stochastic (Supplementary
Fig. [119]2b), suggesting a common pattern of gene expression
heterogeneity; specifically, the 3CA meta-programme “EMT-III” was
enriched in S1 cells, “Interferon/MHC-II (II)” both in S1 and S3 cells,
and “Translation initiation” in S2 cells (Fig. [120]1i and
Supplementary Fig. [121]2c), in agreement with pathway enrichment
analysis (see Supplementary Fig. [122]1f). “EMT-III” contains genes
involved in the maintenance of a hybrid EMT state and belonging to the
S1 signature (e.g., S100A4, TM4SF1, and LGALS3); this meta-programme is
recurrent across donors and cancer types, notably in breast^[123]51.
Finally, we performed scRNA-seq on the TNBC cell line MDA-MB-231 TGL.
Two of the seven clusters we detected in MDA-MB-231 TGL showed high
expression of top S1 signature genes (CDA, S100A4, LGALS3, COL6A1,
COL6A2; Supplementary Fig. [124]2d, e and Supplementary Data [125]5,
[126]6); importantly, these clusters also showed high “EMT-III”
meta-programme expression (Supplementary Fig. [127]2f, g). Taken
together, these results suggest that the transcriptionally stable
signatures of SUM159PT, notably S1, are recurrent in other TNBC models
and primary breast tumours, and are also shared across other cancer
types.
Cancer clones promote tumour initiation in a non-stochastic manner
To investigate the tumour-initiating capacity of SUM159PT, we
transplanted barcode-labelled cells into the mammary fat pads of nine
NSG (NOD/SCID/IL2Rγ[c]^−/−) immunodeficient mice and then evaluated the
barcode composition in each primary tumour. We isolated tumour cells
and extracted the genomic DNA (gDNA), which was then amplified and
sequenced (Fig. [128]2a). Noteworthy, the GBC count measured in bulk
(parental cells) recapitulates the actual clone abundance, measured as
the relative number of cells per clone in single-cell samples
(Supplementary Fig. [129]4a, b). Clone selection was heterogeneous
across tumours—a stochastic effect that we observed also in other
cohorts of tumours transplanted in mice (data not shown). Only between
3% and 33% of SUM159PT clones contributed to tumour formation
(Fig. [130]2c and Supplementary Data [131]7), showing a deep clone
selection. The size of clonal subpopulations greatly varied within a
tumour; on average, the top 1% abundant clones covered more than 50% of
the entire tumour mass (Supplementary Data [132]7), and this was not
merely a consequence of higher initial abundance (see below). This
picture suggests a variable tumour-initiation potential among surviving
clones in vivo. In epithelial cancers, the tumour-initiation potential
has been regarded as an intrinsic feature of cells, rather than a
feature acquired during tumour formation^[133]8. Consistently, we
observed that a limited set of clones was recurrent and covered a high
proportion of the tumour mass compared to sporadic ones (Fig. [134]2c).
To exclude bias in the detection of expanded clones, we considered GBC
abundance relative to pre-implantation (Fisher’s exact test, see
section “Methods”). In total, 138 clones were significantly more
abundant in at least 6 out of 9 tumours compared to their average
abundance at basal state. We refer to these as tumour-initiating clones
(TICs) hereafter (Fig. [135]2d). We conclude that the tumour-initiating
capacity of SUM159PT cells is largely pre-encoded.
Fig. 2. Tumour-initiating clones are recurrent and originate from S1 and S3
subpopulations.
[136]Fig. 2
[137]Open in a new tab
a Tumour-initiation assay. Barcoded SUM159PT cells were injected
orthotopically in NSG mice; gDNA from parental (n = 3) and tumours
(n = 9) were sequenced. b Left: clone count as number of distinct GBCs
(bounds of box: upper (q75) and lower (q25) quartiles; centre: median;
upper whisker: min{max(x), q75 + 1.5·IQR}; lower whisker: max{min(x),
q25 − 1.5·IQR}). Right: cumulative clone frequency, where GBCs are
ordered by non-increasing abundance. c Each graph refers to a tumour
and each dot is a clone; clones are grouped by the number of times they
are observed across tumours (x = k) and their frequency over the total
tumour size is shown on y. d Left: detection of tumour-initiating
clones (TIC) by comparison between clone abundance in tumour t in the
parental population. Right: fraction of clones (top) and relative clone
abundance (bottom), in parental and tumour samples, grouped by clone
class. e Mapping of TICs at parental state (T0). Top: UMAP
representation of T0 cells in gene expression space (TICs in blue).
Bottom: log-odds ratio of cluster assignment vs TIC labelling at T0. f
Association between T0 clusters and clone expansion in vivo. Top:
normalised cluster abundance in each tumour (unassigned clones in
grey). Bottom: TIC odds ratio in subpopulations. g Prospective
isolation of S1 cells by FAC-sorting with TM4SF1 antibody (see
Fig. [138]1g). Top: gate used for TM4SF1^high sorting. Bottom:
differentially expressed genes (RNA-seq) between TM4SF1^high and bulk
at days 0, 9, and 43 (n = 2 each). Entries are expression log[2](FC)
between conditions at the same time point. scRNA-seq and Multiome gene
signatures are highlighted in colour (shared genes in black) and the 20
top upregulated genes in TM4SF1^high cells labelled. h TM4SF1^high
cells are enriched for TICs. Top: TM4SF1^high or bulk cells are
injected orthotopically at different dilutions. Bottom: response and
average latency. i Tumour growth and disease-free survival. Top: growth
dynamics (in days) of each primary tumour derived from transplantation
of 100 cells (n = 7). Data are mean ± SEM. Asterisks mark the
significance two-sided, unpaired t-test (*p < 0.05, **p < 0.01,
***p < 0.01). Bottom: Kaplan–Meier curve reporting the time-dependent
appearance of primary tumours derived from injection of 100 cells
(Log-rank Mantel–Cox test) [a, h Created with Biorender.com released
under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0
International license].
The baseline programmes S1 and S3 predict tumour initiation
To determine which transcriptional states are primed to tumour
initiation, we traced TICs back to their parental population. TICs were
strongly associated with S1 and S3 transcriptomes at baseline, with the
two subpopulations showing a similarly strong enrichment (average odds
ratio 4.4 and 4.1 for S1 and S3, respectively; Fig. [139]2e,
Supplementary Fig. [140]4c and Supplementary Data [141]7). Both S1 and
S3 were transcriptionally stable in culture in a timeframe of 2 weeks,
as shown in Fig. [142]1f, suggesting that the gene expression profile
of TICs at baseline may be predictive of the phenotype. S1 and S3 gene
signatures partially overlap: 9 out of the 29 gene expression markers
of S3 are shared with S1, including some of the top ranked in S1
(COL1A1, NPTX2, NREP; see Fig. [143]1g). However, the two
subpopulations showed different gene expression patterns, with S1 being
clearly separated from all the other clusters in gene expression space
(average silhouette width 0.29 and 0.30, respectively; Supplementary
Fig. [144]1e), suggesting that TICs stem from two distinct
transcriptional programmes. We then asked whether the clones in S1 and
S3 differ in terms of their expansion potential in vivo. When the
subpopulation identity was mapped on tumours, clones in S1 and S3
highly contributed to the tumour mass, relative to their initial
abundance at baseline, compared to the other clones (Fig. [145]2f and
Supplementary Fig. [146]4d); upon transplant, the expansion rate of S1
was high compared to S3 (12.7-fold for S1 and 6.9-fold for S3, on
average; Supplementary Data [147]7). We profiled the transcriptome of
SUM159PT tumours by bulk RNA-Seq. Neither the S1 nor the S3 signatures
as a whole were upregulated in SUM159PT tumours with respect to the
parental population, including some among the top significant genes
(Supplementary Fig. [148]4e, f and Supplementary Data [149]8), hinting
that clones undergo transcriptional reprogramming upon transplant. This
change in gene expression profile is in line with the cancer stem-cell
hypothesis, where a small, stem-like, cell subpopulation exhibits both
tumourigenic and differentiation potential. Of note, S100A4, TM4SF1,
and LGALS3, three among the top significant genes in the S1 signature
belonging to the EMT-III meta-programme (see Fig. [150]1g and
Supplementary Fig. [151]2c), were upregulated in SUM159PT tumours
(log[2](FC) = 2.24, 3.40, 3.76 and adj. p value = 1.09e − 27,
2.85e − 34, 2.09e − 30, respectively). TIC signature gene expression
was persistent in metastatic pancreatic cancer mouse models^[152]20 and
in the pre-metastatic niche of lung adenocarcinoma^[153]52
(Supplementary Fig. [154]4g, h). Notably, the metastatic potential in
these tumour models has been associated with the adoption of late
hybrid EMT states and the activation RUNX2, a transcription factor
mediating extracellular matrix remodelling, in agreement with our
findings (see Fig. [155]1i and Supplementary Figs. [156]1g, [157]2g).
We conclude that the tumour-initiating niche of SUM159PT shares markers
across different cancer diseases and, although plastic, could be
partially reminiscent of its molecular state at baseline. To directly
verify the tumour-initiating potential of TICs, we searched for surface
markers for prospective isolation and identified transmembrane 4 L6
family member 1 (TM4SF1), 1 of the top 20 significant genes of the S1
signature and highly upregulated in SUM159PT tumours; we could not
identify any S3-specific surface marker. High TM4SF1 protein expression
has been linked to CSCs and previously employed for prospective
isolation of cancer subpopulations in human and murine breast
models^[158]53,[159]54. Therefore, we set up a strategy for isolating
TM4SF1^high cells by FAC-sorting (gated on top 5%; Fig. [160]2g and
Supplementary Fig. [161]5a–d); the TM4SF1^high population showed
extensive upregulation of several genes in the S1 signature compared to
the bulk population, and this was not the case for genes in S2 and S3
signatures (RT-qPCR and RNA-seq; Fig. [162]2g and Supplementary
Fig. [163]6a–c). Of note, the expression of the S1 signature was
maintained even after several passages in culture (Supplementary
Fig. [164]6c and Supplementary Data [165]9). TM4SF1^high-associated
genes were mainly related to invasion and metastasis pathways and
suggestive of TWIST1, STAT3, and HIF1A activation (Supplementary
Fig. [166]6d). Limiting dilution transplantation is a well-established
approach to quantify the tumour-initiating content of a cell
population. We injected orthotopically serial dilutions from bulk and
TM4SF1^high populations (Fig. [167]2h and Supplementary Fig. [168]5a)
into NSG (NOD/SCID/IL2Rγ[c]^−/−) immunodeficient mice. At the lowest
dilution (100 cells), TIC number is a limiting factor and TM4SF1^high
cells developed tumours with higher efficiency than mice transplanted
with the same number of bulk cells (26% average latency reduction;
Fig. [169]2i and Supplementary Fig. [170]6e, g), suggesting that the
TIC content of the TM4SF1^high subpopulation is higher. We conclude
that S1 holds an increased tumour-initiating capacity compared to the
whole SUM159PT population.
The S3 programme confers a selective growth advantage upon chemotherapy
We next investigated the response of cancer clones upon drug response
in vitro on cultured cells and in vivo on transplanted tumours, using
paclitaxel, an anti-mitotic chemotherapy agent used to treat many
cancer types^[171]55. We treated barcode-labelled SUM159PT cells at
50 nM (which corresponds to ~IC95; Supplementary Fig. [172]7a) for 3
days in culture, with the untreated condition as a control
(Fig. [173]3a, top). To evaluate the drug response in vivo, we
transplanted barcode-labelled SUM159PT cells into the mammary fat pads
of six NSG immunodeficient mice; once tumour was formed, mice were
treated with paclitaxel every 5 days (Fig. [174]3a, bottom). In vitro,
treatment induced a deep clonal selection: between 9% and 22% of the
initial clone pool survived ≥10 days post-paclitaxel removal, while
cells cultured for a comparable time span in the absence of treatment
did not undergo clonal selection (Fig. [175]3b, top, and Supplementary
Fig. [176]7c and Supplementary Data [177]7). A comparable effect was
observed in an independent barcoding experiment (Supplementary
Fig. [178]7c). In vivo, paclitaxel treatment delayed tumour growth, but
did not trigger remission of the disease (Supplementary Fig. [179]7d);
the major driver of clone selection was the tumour-initiation capacity
(Fig. [180]3b, bottom). Of note, the clones able to survive and expand
were not randomly selected, but recurrent upon independent treatments,
both in vitro and in vivo (Supplementary Fig. [181]7e, f); therefore,
we reasoned that both survival and proliferation potential were
pre-encoded. We defined the drug-tolerant clone (DTC) pool as the set
of clones that were significantly more abundant after treatment in at
least four out of six samples compared to their average abundance at
basal state (Fisher’s exact test; see section “Methods” and
Fig. [182]3c). We detected 171 and 164 DTCs in vitro and in vivo,
respectively. When traced back to their baseline transcriptional state,
clones surviving drug insult in vitro were depleted in S1, but were
more abundant in S3 than expected by chance (Fig. [183]3d, left, and
Supplementary Fig. [184]8a, c and Supplementary Data [185]1). In
contrast, clones surviving drug treatment in vivo were belonging to
either S1 or S3 (Fig. [186]3d, right, and Supplementary Fig. [187]8b
and Supplementary Data [188]1). Note that 71% of TICs were also
drug-tolerant in vivo (Supplementary Fig. [189]8d and Supplementary
Data [190]7), confirming that the effect of paclitaxel in vivo was
modest. When assessing the relative abundance of S1 and S3 clones in
tumours treated with paclitaxel, S3 showed a higher fitness over S1
(Fig. [191]3e, and Supplementary Fig. [192]8e and Supplementary
Data [193]7), in agreement with in vitro results. We deduced that the
drug tolerance phenotype is different from the tumour-initiating
capacity in SUM159PT and is primarily associated with the S3 baseline
programme.
Fig. 3. Drug-tolerant clones are recurrent and are enriched in S3 cells.
[194]Fig. 3
[195]Open in a new tab
a Experimental design, drug tolerance assay. Top: in vitro assay.
Barcoded SUM159PT cells were treated with paclitaxel in vitro and
harvested when single-cell colonies were grown (n = 6). GBC loci were
PCR amplified and sequenced. The untreated parental population at T0
(n = 3) and T1 (n = 3) was also sequenced as a control. Bottom: in vivo
assay. Barcoded SUM159PT cells were injected orthotopically in NSG
mice; after tumour formation, mice were treated or not (see
Fig. [196]2a) with paclitaxel. Parental samples (n = 3) were also
sequenced as a control. b Clone selection upon treatment. Top:
comparison of total clone count (left) and cumulative clone
distribution (right) in parental, untreated, and treated in vitro
samples (bounds of box: upper (q75) and lower (q25) quartiles; centre:
median; upper whisker: min{max(x), q75 + 1.5·IQR}; lower whisker:
max{min(x), q25 − 1.5·IQR}). Bottom: same as above, for parental,
untreated tumour (as in Fig. [197]2a), and treated tumour samples (see
also Fig. [198]2b legend). c Detection of drug-tolerant clones (DTC).
Top: cartoon showing the comparison between clone abundance in each
sample compared to the (average) abundance in the parental population;
clones significantly more abundant in k = 4 out of 6 samples are
defined as drug tolerant. Bottom: bar plot showing the relative clone
abundance in parental and treated samples in vitro and in vivo,
respectively, and grouped by class (drug tolerant or not). d Mapping of
the drug-tolerant clones at parental state (T0). Top: UMAP
representation of T0 cells on gene expression space, with cells
classified as DTC in vitro (left) or in vivo (right) coloured in
orange. Bottom: log-odds ratios obtained from the contingency table
comparing cluster assignment and DTC labelling in vitro (left) or in
vivo (right) across cells at T0. e Association between parental state
(T0) and clone expansion in vivo, without and with treatment. Top: bar
plot showing the relative normalised abundance of T0 clusters in every
untreated (left, as reported in Fig. [199]2f) or treated tumour
(right), respectively (unassigned clones shown in grey). Bottom:
cartoon highlighting the subpopulations enriched in DTCs (odds ratio
values reported below) [a Created with Biorender.com released under a
Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International
license].
GALILEO links cancer clones with transcriptional programmes and DNA
accessibility states at single-cell resolution
To investigate the epigenetic state of cancer clones and relate it to
the transcriptional readout, we developed Genomic bArcoding pLus
sIngLE-cell multi-Omics (GALILEO). Specifically, we performed
single-cell Multiome ATAC plus gene expression sequencing on
barcode-labelled SUM159PT nuclei in two biological replicates
(Fig. [200]4a). This assay enables access to gene expression, DNA
accessibility, and clone information simultaneously at single-nucleus
resolution. At baseline, we identified 2023–2024 unique clones and
assigned a clonal identity to between 73% and 86% of high-quality
nuclei (Supplementary Fig. [201]9a and Supplementary Data [202]1). We
obtained seven and six clusters for the two replicates, respectively,
and retrieved the subpopulations S1, S2, and S3 previously identified
by scRNA-seq (Fig. [203]4b and Supplementary Fig. [204]9b, c), with
largely overlapping gene signatures (Supplementary Fig. [205]9d and
Supplementary Data [206]10). When evaluating their DNA accessibility
state, nuclei from S1, S2, and S3 showed distinct profiles in the
scATAC-seq space, comparable to the scRNA-seq results (Supplementary
Fig. [207]9e). To identify patterns of co-accessibility in the set of
~10^5 regions detected from scATAC-seq, we used cisTopic^[208]56, a
tool for scATAC-seq data analysis based on a topic modelling framework.
Briefly, topics^[209]56 are hidden variables represented as probability
values across all ATAC regions in the dataset, and, conversely, cells
are represented as probability values over topics; the benefit of this
approach is that the number of topics is typically much smaller than
the number of regions. Groups of regions and groups of cells where
these regions are co-accessible are thus simultaneously captured via
their association with topics. Next, we compared the region probability
across every topic pair between the two replicates using the
irreproducible discovery rate (IDR; see section “Methods” and
Supplementary Fig. [210]10a). We defined a subset of reproducible
regions (i.e., satisfying IDR < 0.05) for each topic pair, referred to
as ATAC module hereafter, and assigned a reproducibility score to them
(Fig. [211]4c, left, and Supplementary Fig. [212]10b and Supplementary
Data [213]11). Note that our approach based on topic modelling discards
ubiquitously accessible regions; combined with IDR filtering, this
results in a substantial reduction of the size of the dataset (see pie
chart in Fig. [214]4c). Most reproducible regions were found in few,
large ATAC modules containing more than 400 regions, the largest one
containing 1511 regions (Fig. [215]4d and Supplementary Fig. [216]10c).
This few-to-few mapping across replicates suggests that the grouping of
the regions into ATAC modules is non-stochastic. Therefore, each of
these modules is expected to identify a pool of genomic elements that
jointly participate in the regulation of gene expression. More than 90%
of the regions could be assigned a regulatory element, according to the
ENCODE cCRE registry: 9% were annotated as promoter-like signatures
(PLS), within 200 bp of transcription start sites (TSS) of genes, 7%
and 75% as proximal and distal enhancer-like signatures (pELS, dELS),
respectively (Fig. [217]4c).
Fig. 4. A major DNA accessibility programme predicts tumour initiation and is
associated with S1 and S3 subpopulations.
[218]Fig. 4
[219]Open in a new tab
a Genomic bArcoding pLus sIngLE-cell multi-Omics (GALILEO) strategy.
Barcoded SUM159PT cells are processed with Multiome. The readout of
each cell is lineage, gene expression profile and DNA accessibility
state. b UMAP representation in gene expression space coloured by
cluster (replicate 1, 2446 nuclei). c Topic modelling on ATAC-seq
regions. Left: comparison of the output of topic modelling on the two
replicates; and entries are reproducibility scores (see section
“Methods”); topics highly correlated with coverage are not shown (see
Supplementary Fig. [220]10b). Centre: topic pairs ordered by
non-increasing score (yellow: non-empty modules). Right: reproducible
region fraction and their annotation by ENCODE registry (bottom) of
candidate cis-regulatory elements (cCREs) as PLS (promoter), pELS
(proximal enhancer), or dELS (distal enhancer) (replicate 1). d
Comparison between ATAC modules and gene expression in single nuclei
(replicate 1). Rows are the top 20 scoring modules, columns are nuclei
and entries are module AUC scores representing the overall
accessibility of a module. The association (AUC) of a module to
subpopulations (S1, S2, and S3) and cancer fates (TIC and DTC) in vitro
is shown on the right; high associations (AUC > 0.75) with any
subpopulation are highlighted in bold. Columns are clustered on
Euclidean distances using a complete method from hclust package. e
Module 1 AUC as a predictor of S1 (red), S3 (black), or TICs (grey) on
replicate 1. f Association between module AUC scores and gene
expression (replicate 1). Each dot is a gene and its value represents
the (positive) Spearman’s ρ correlation coefficient between its
expression and module AUC score. Genes are coloured according to
scRNA-seq and Multiome gene signatures. Transcription factors with
ρ ≥ 0.2 in either module are labelled. g TF enrichment on the genes
whose expression is positively correlated with Module 1 AUC. Left: TFs
sorted by rank, with top 10 ranked TFs labelled. Right: fraction of
genes (coloured in pink) whose locus is ≤100 kbp away from any region
in Module 1, for the 10 top-ranked TFs [a Created with Biorender.com
released under a Creative Commons Attribution-NonCommercial-NoDerivs
4.0 International license].
ATAC modules recapitulate the multiple DNA accessibility profiles of gene
expression clusters
The regions found in the same ATAC module are accessible in the same
cells, and, by definition, are reproducible across replicates. We
investigated the relationship between ATAC modules and transcriptional
or clonal subpopulations. We computed a score for each ATAC module M
and for each cell c (AUC; see section “Methods”), representing the
overall accessibility of the regions of M in c, and referred to as
module AUC hereafter. Using module AUC, we associated several large
ATAC modules with either S1, S2, or S3 (Fig. [221]4d and Supplementary
Fig. [222]10c). Among the 20 highly reproducible modules, 8
(highlighted in bold in Fig. [223]4d) could be associated to any of S1,
S2, or S3, with Module 1 (1511 regions) being the top predictor of both
S1 (AUC = 0.98) and S3 (AUC = 0.84) (Fig. [224]4e). Module 20 (50
regions) showed an equally strong association with S3 (AUC = 0.84). To
further assess the relationship between ATAC modules and
subpopulations, we correlated genome accessibility (using module AUC,
see above) with gene expression across cells (Spearman’s ρ;
Fig. [225]4f and Supplementary Data [226]12). This approach ensures
that mechanisms involving either cis or trans genomic elements can be
captured, as no constraint on region-gene proximity was used. Overall,
module AUC was positively correlated with the expression of genes of
the associated subpopulation signature (Fig. [227]4f and Supplementary
Fig. [228]10d). These results support the hypothesis that ATAC modules
contain regulatory elements jointly involved in the control of specific
transcriptional programmes.
Tumour-initiating clones share a common chromatin priming state
We then sought to investigate the role of ATAC modules in cancer
phenotypes, namely, tumour-initiation and drug-tolerance capacity.
Importantly, Module 1, the top reproducible accessibility state,
predicted TICs with high specificity and sensitivity (Fig. [229]4e and
Supplementary Fig. [230]10e), consistently with it being associated
with S1 and S3, which, in turn, are enriched in TICs (see also
Fig. [231]2e, f). This suggests that the tumour-initiating capacity may
be linked to a specific and pre-existing epigenetic state and may
explain the phenotypic relationship between the cells of S1 and S3.
Subsequently, we used transcriptional and epigenetic information
jointly and at single-cell resolution to highlight the gene regulatory
networks and the epigenetic determinants involved in the
tumour-initiating capacity. For each module m, we used the set G[m] of
positively correlated genes (see above) as input to measure the
transcription factor activity in m given a set of putative targets for
each TF^[232]57 (Fig. [233]4g and Supplementary Data [234]13). Among
the top-ranked TFs for Module 1, we detected several TFs that have been
previously linked to tumour-initiation capacity, including TWIST2,
PRRX1, and RUNX2; note that the S1 gene signature is enriched in RUNX2
targets (see Supplementary Fig. [235]4h). TWIST2 is a member of the
TWIST family of TFs, which has been extensively associated with poor
tumour prognosis, EMT, and stem-cell activity in breast
cancer^[236]58–[237]60. Similarly, RUNX2 activity has also been linked
to the regulation of EMT, matrix remodelling, and invasive
phenotypes^[238]52, which lead to metastasis, notably in breast
cancer^[239]35,[240]61; finally, PRRX1 has been recently shown to
sustain metastatic dissemination and induce a switch to a
mesenchymal-like state in a melanoma cancer model^[241]26. RUNX2,
TWIST2, and PRRX1 were top ranked in Module 1 also using the set of
genes proximal to Module 1 ATAC regions as input (Supplementary
Fig. [242]10f). For each TF, we identified its regulon by the set of
positively correlated target genes whose locus is proximal (<100 kbp)
to any region of Module 1. Several genes in the S1 signature, including
procollagen C-endopeptidase enhancer 1 (PCOLCE) and collagen-encoding
genes (COL6A1, COL6A2, COL5A1, COL5A3), were found as part of PRRX1 and
TWIST1 regulons (Fig. [243]5a and Supplementary Fig. [244]10g). To
directly verify the role of the genomic elements of Module 1 in gene
regulation, we selected two regions located in the proximity of COL6A1
and COL6A2, highly accessible in S1 cells (Supplementary Fig. [245]10h)
and classified as dELS by ENCODE cCRE (Fig. [246]5b, top).
Subsequently, we targeted the two regions by means of an inducible
CRISPR interference strategy^[247]62,[248]63 (Fig. [249]5c). We
observed that repression of either region led to a consistent and
reproducible reduction in COL6A1 (up to 36% and 46%) and COL6A2
expression (up to 43% and 62%; Fig. [250]5d), both at early (3 days)
and late time points (10 days) post-dCas9-KRAB induction.
Fig. 5. Silencing of selected Module 1 regions triggers a reduction in
transcription of proximal genes.
[251]Fig. 5
[252]Open in a new tab
a The “PRRX1 regulon” includes the set of genes whose expression is
positively correlated with Module 1 AUC and that (i) are predicted as
PRRX1 targets by ChEA3 and (ii) lie at ≤100 kbp from any region in
Module 1. b COL6A1 and COL6A2 loci; shown are the scATAC-seq peaks
(aggregate signal, replicate 1). Red dots label Module 1-specific
regions. In the magnification is shown the region containing the two
enhancers, ENH1 and ENH2, together with ENCODE regulatory tracks for
H3K4Me1, H3K27Ac, and DNase clusters and the position of sgRNAs for
CRISPRi (see also d). c Scheme showing the CRISPRi approach and the
model of Enhancer-Gene pair regulation. d RT-qPCR data showing the
impact in COL6A1 and COL6A2 expression in TM4SF1^high SUM159PT_KRAB
cells observed upon expressing sgRNAs targeting ENH1 (in pink) or ENH2
(in blue) (see also b). Data are measured at day 3 (top) or day 10
(bottom) after dCas9-KRAB induction by doxycycline (n = 6 and 7 sgRNAs,
respectively) and compared to a group of non-targeting sgRNAs (n = 4)
[c Created with Biorender.com released under a Creative Commons
Attribution-NonCommercial-NoDerivs 4.0 International license].
A subset of the drug-tolerant subpopulation exhibits a pre-existing genomic
amplification
Module 20 predicts S3 with high specificity and sensitivity (AUC = 0.84
and 0.75 in the two replicates; Fig. [253]6a and Supplementary
Fig. [254]11a). As shown in Fig. [255]3, the S3 programme is associated
with increased drug tolerance, both in vitro and in vivo. We noticed
that most regions of Module 20, as well as several genes of the S3
signature, were located on a 5.5 Mbp-long region of chromosome 11
(Supplementary Fig. [256]10i). The odds are low that an epigenetic
regulatory mechanism involves a cluster of highly localised genomic
elements and thus we reasoned that a genetic alteration might better
explain the transcriptional profile of S3. When interrogating the
whole-exome sequencing profile of paclitaxel-treated samples and
comparing it to that of untreated cells (n = 3; see Fig. [257]3a), we
detected 18 recurrent copy-number variants (CNVs; Fig. [258]6b and
Supplementary Fig. [259]11b). Notably, the top amplified region
(average log[2](FC) = 0.53 and p value = 5.65e − 185) lied on
chromosome 11, specifically across bands 11q23–11q24 (Fig. [260]6c).
These results suggest that the amplification was already present in S3
cells before treatment and that Module 20 captures a specific genetic
background of S3, rather than a localised increase in chromatin
accessibility. This implies that the drug tolerance phenotype is, at
least in part, genetically determined, and, in turn, suggests that a
subset of DTCs could maintain a stable memory of the treatment.
Therefore, we investigated the susceptibility of cells to paclitaxel
upon a recovery period of 24 days after a first round of treatment (see
section “Methods”, Fig. [261]6d, top, and Supplementary Fig. [262]11c).
Upon a second round of treatment, clonality was reduced by 63% (average
recovery at day ≥ 17 compared to day ≤ 3; Fig. [263]6d, bottom),
suggesting that chemotherapy was still effective; however, drug
sensitivity was low compared to a single round of treatment, where only
19% of clones survived. Finally, to verify the specificity of the
association observed between the chr11 region amplification and
resistance to paclitaxel, we examined a drug resistance model, where
SUM159PT cells were repeatedly treated with increasing doses of
paclitaxel (see section “Methods”, Fig. [264]6e, top, and Supplementary
Fig. [265]11c) up to the onset of a drug resistance phenotype. The WES
profile confirmed an amplification on chromosome 11 whose locus
overlaps 75% of the above-detected one (Fig. [266]6e, bottom, and
Supplementary Fig. [267]11d and Supplementary Data [268]14). We
conclude that, in SUM159PT, paclitaxel-based chemotherapy causes a
clonal expansion of a subpopulation harbouring the amplification of
11q23–11q24.
Fig. 6. Drug-tolerant clones display chr11 amplification.
[269]Fig. 6
[270]Open in a new tab
a Module 20 AUC as a predictor of S3 (black) or DTCs in vitro (grey) on
Multiome replicate 1. Bottom: UMAP representation of T0 cells on gene
expression space coloured as in Fig. [271]3d. b Copy-number variants
(CNVs) pre- and post-treatment (drug tolerance assay, replicate 1).
ATAC-seq coverage (Multiome) on CNV loci^[272]52. Rows are consensus
CNV from paclitaxel-treated samples (day 15; n = 3; experiment as in
Fig. [273]3a); columns are nuclei in Multiome replicate 1; entries are
cumulative ATAC counts per CNV locus per nucleus. The coverage
log[2](FC) between each treated sample and the parental (left) and
chromosome location (right) are indicated. Columns are clustered with a
complete method on Euclidean distances using hclust. c Circos plot
showing the association between Module 20 regions at baseline (see
Fig. [274]4c) and CNVs in treated condition. In the dot plot, the y
axis is the IDR for the regions in Module 20; CNVs for each replicate
are coloured by log[2](FC); consensus CNVs are shown in black. d Drug
tolerance assay, second round. Top: SUM159PT cells were treated with
paclitaxel and clone selection was stabilised until T0′; cells were
subsequently infected with the GBC library sorted by BFP expression,
subjected to a second round of treatment, and harvested at T1′. The
three populations (T0, T0′, and T1′) were sequenced. Bottom left: total
clone count at T0′ and T1′ (bounds of box: upper (q75) and lower (q25)
quartiles; centre: median; upper whisker: min{max(x), q75 + 1.5·IQR};
lower whisker: max{min(x), q25 − 1.5·IQR}). Bottom right: cumulative
clone distribution after either one round (see also Fig. [275]3b, top)
or two rounds of treatment. e Top: long-term drug resistance assay in
vitro. SUM159PT cells were repeatedly treated with increasing doses of
paclitaxel until resistant clones were obtained, which were then
processed by WES. Bottom: ATAC-seq coverage (Multiome) on CNV
loci^[276]52, as in (b) [d, e Created with Biorender.com released under
a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International
license].
Two distinct clone lineages can enter a drug tolerance state
We note that the 11q23–11q24 amplification is only found in S3.
However, in vitro, many DTCs fall outside S3 (see Fig. [277]6a and
Supplementary Fig. [278]8a), highlighting their molecular heterogeneity
at baseline. To further characterise the pathways leading to drug
tolerance in vitro, we designed a single-cell time-course experiment on
barcoded SUM159PT cells (Fig. [279]7a, top). To minimise technical
variability, we treated cells for 3 days with paclitaxel, collected
samples every 2 days after drug removal (days 5–15) and sequenced them
simultaneously using a reverse time course as in ref. ^[280]64.
Consistent with untreated samples, we assigned a clonal identity to
most of the high-quality cells (80–84%, Supplementary Fig. [281]12a,
b). Treatment elicited a progressive clone selection, from 1126
distinct clones at day 5 to 433 at day 15 (Fig. [282]7a and
Supplementary Data [283]15). An independent experiment confirmed the
clone selection dynamics (Supplementary Fig. [284]12b, c). Chemotherapy
induced a substantial transcriptional change (Fig. [285]7b, c, and
Supplementary Fig. [286]12c, d and Supplementary Data [287]16). At
initial time points (d5–d9), most cells were drug-sensitive and
developed two distinct responses to treatment: the one characterised by
the induction of stress response pathways, including amino-acid
deprivation, unfolded protein response, and inflammation (cluster 1);
the other mostly characterised by autophagy (cluster 2; Fig. [288]7d,
and Supplementary Fig. [289]12e and Supplementary Data [290]17). At
late time points, surviving cells showed enhanced translation activity
(cluster 3). At initial time points (d5 and d7), cells belonging to the
surviving pool (i.e., DTCs) accounted for only 9% of the whole sample
and their transcriptional profile was scattered (Fig. [291]7b and
Supplementary Fig. [292]12d). DTCs survived the treatment by remaining
in a state of suspended proliferation for several days and, starting
from days 9 to 11, entered an intense proliferation phase; from day 13
on, DTCs constituted 73%–85% of surviving cells (Fig. [293]7a and
Supplementary Fig. [294]12c) and acquired a distinguishing
transcriptional profile (Fig. [295]7e). At late time points (days
13–15), cells stemming from the same clone at baseline overall
displayed a divergent transcriptional programme (solid line),
comparable to that of cells belonging to different clones (dashed line,
Fig. [296]7f). Then, we asked whether any distinguishing
transcriptional footprint exists in highly expanded clones. To do this,
we devise an unsupervised approach to find sets of mutually similar
clones (by defining a pair propensity score across gene expression
neighbourhoods; see section “Methods” and Fig. [297]7g). Two groups of
clones, or lineages, showed mutually high transcriptional similarity
(Fig. [298]7g, h and Supplementary Fig. [299]13a), suggesting that
multiple pathways to drug tolerance may exist in SUM159PT cells. Both
lineages contained highly abundant clones, indicating that the
transcriptional readout is not associated with proliferation potential.
The two lineages were reproducible both across time points and across
independent experiments (Supplementary Fig. [300]13b, c). On average,
they accounted for 50% (lineage 1) and 35% (lineage 2) of the cells at
late time points (the remainder fraction belongs to unclassified
clones). Notably, clones in lineage 1 stemmed from S3 and were
characterised by a pre-existing genomic amplification on band
11q23–11q24. Among the upregulated genes in lineage 1, we detected
MT1E, whose relevance in breast and other cancer types has been
extensively proven^[301]65,[302]66, as well as FEZ1 and RPS25, top
significant genes in the S3 signature (Fig. [303]7i, j and
Supplementary Data [304]16). Consistently, genes located within the
11q23–11q24 amplification, which is specific to S3, were upregulated in
clones belonging to lineage 1 (Fig. [305]7k and Supplementary
Fig. [306]13d, e). This showed that the transcriptional differentiation
breadth of the two lineages is determined by the genetic background of
the ancestor clone, specifically, depending on whether it carries the
11q23–11q24 amplification or not. In contrast, we did not detect any
lineage 2-specific copy-number aberration (Supplementary
Fig. [307]13f). The top upregulated genes in the lineage 2 signature,
namely, S100A2, IGFBP2, IFI27, and PVT1, were most highly expressed
immediately following treatment in both DTCs and non-DTCs
(Fig. [308]7j), suggesting that the transcriptional programme of
lineage 2 is not DTC-specific. The early response to paclitaxel
includes upregulation of PVT1, a long non-coding gene acting as a
negative regulator of the transcription factor MYC, a key regulator of
growth and cellular metabolism, frequently associated to breast
cancer^[309]67,[310]68. Consistently, we observed that MYC activity
decreased immediately after treatment and increased during adaptation
in response to it (Supplementary Fig. [311]13g), with only slight
differences between the two lineages (Supplementary Fig. [312]13h).
Consistent with our findings, recent evidence showed that reduced MYC
activity promotes a chemotherapy survival phenotype in breast cancer
via the adoption of an embryonic-like diapause state^[313]69. In
conclusion, we mapped the transcriptional response upon drug treatment
at a clonal level and isolated different pathways of cancer
transcriptional evolution leading to resistance, one of them being
invariably linked to a pre-existing genetic rearrangement.
Fig. 7. Drug-tolerant clones stem from two distinct lineages.
[314]Fig. 7
[315]Open in a new tab
a Time-course drug tolerance assay (exp1). Top: SUM159PT cells were
treated with paclitaxel for 3 days, harvested every 2 days
post-paclitaxel removal (n = 1/time point), and processed with
scRNA-seq using reverse time-course multiplexing. Bottom: fraction of
DTC in vitro (see Fig. [316]3c, top) across time points. b
Drug-tolerant clone selection. In total, 7884 treated cells are shown
and coloured by clone class. c UMAP representation of cells at days 5,
7, 9, 11, 13, and 15 and coloured by cluster. d Cluster signatures:
rows are genes, columns are clusters, and entries are log[2](FC) values
between a cluster and its complement. Significantly upregulated genes
in any cluster are shown and order by average log2(FC) and adjusted p
value ranking (lower to higher; MAST method, Bonferroni correction).
The top 10 genes for each cluster are labelled. e Gaussian kernel
density of Euclidean distances in gene expression between DTCs,
non-DTC, and between DTCs and non-DTC, before (T1, left) or after
treatment (exp1, right). f Gaussian kernel density of Euclidean
distances between sister and non-sister cells at day 15. g Gene
expression similarity bias by pair propensity. Left: calculation of
pair propensity between clones (see section “Methods”). Right: pair
propensity for top expanded clone pairs at day 15 (clones i with
p[ii] < 1 not shown); clones are annotated with clone abundance (count
per thousand cells, rows) and lineage (columns). h UMAP representation
of cells at days 13 and 15 coloured by lineage assignment (unassigned
clones in grey). i Rows are lineage gene signatures, columns are
independent experiments, and entries are average log[2](FC) values
between lineages 1 and 2 after treatment. Genes are ordered by
non-increasing log[2](FC) in exp1. The 10 DEGs with higher and lower
log[2](FC) are labelled. j UMAP plot of cells at day ≥ 5 coloured by
log-normalised gene expression of the four top log[2](FC) genes of
lineages 1 and 2. k Rows are genes whose locus lies in
chr11:118307287–123754518 (see Fig. [317]6b), columns are cells at day
15 and entries are scaled log-normalised UMI counts. Rows are sorted by
non-increasing average expression; columns are clustered with complete
method on Euclidean distances.
Discussion
One of the main challenges in cancer biology is predicting how tumours
evolve in response to changes in the tumour environment. The capacity
of one or more clones to sustain tumour growth at distal sites or to
trigger disease relapse upon cytotoxic treatment may depend on a
specific set of molecular characteristics. Their identification has
been the focus of intense research efforts, both for the clinical
applications and for the understanding of the mechanisms underlying
tumour plasticity.
Recently, it has been suggested that the cancer stem-like pool might be
heterogeneous, with distinct subpopulations primed to different
fates^[318]4–[319]10. The SUM159PT model is one representative example,
in which distinct states coexist in equilibrium^[320]33. Here, we
provided the distinctive transcriptional and epigenetic traits of each
sub-pool. We analysed tumour initiation and drug tolerance at
single-cell level on lineage-barcoded cells and on the same system,
providing a high-resolution representation of tumour complexity.
We initially asked which clonal subpopulations lie in a defined
transcriptional or epigenetic state. In the case of SUM159PT, only a
fraction of clones displayed a stable transcriptional profile at the
subpopulation level, hinting that these may encode for specific
functions. Indeed, the TICs were almost exclusively associated with
either S1 or S3 signatures, with S1 showing the strongest enrichment
and S3 also encoding for DTCs. The third signature (S2) was not
attributable to any cancer property, although it contains basal markers
(e.g., miR-205HG, HMGA1).
An important and direct conclusion of our study is that TICs and DTCs
do not coincide but coexist in the same cancer population, sharing a
minor subset of clones (belonging to S3). TICs and DTCs can
significantly change their transcriptional profile during cancer
evolution, adapting to the environment and conditions. In transplanted
tumours, TICs lose their baseline signature upon expansion and maintain
upregulation of only a handful of markers (e.g., S100A4, TM4SF1), whose
phenotypic relevance is confirmed in the
literature^[321]20,[322]37,[323]39,[324]52. Similarly, DTCs undergo a
massive transcriptional reprogramming after treatment and, thus, are
strikingly distinct from their non-DTC counterpart, a behaviour
observed in other cancer cell models^[325]64,[326]69. Using lineage
tracing information alongside a time-course single-cell profiling, we
could describe two distinct and co-occurring transcriptional
trajectories in drug adaptation. Differently from TICs, the DTC
subpopulation did not show a strong transcriptional or epigenetic
determinant at the baseline. However, a subset of DTCs, lying in S3,
shows amplification of a 5.5 Mbp-long region of chromosome 11 (bands
11q23–11q24). This genetic background also reproducibly segregated with
chronic treatment resistance in the SUM159PT model. To our knowledge,
this amplification is not reported as a recurrent alteration in cancer
([327]https://cancer.sanger.ac.uk/cosmic). We confirmed the existence
of a large amplicon spanning the chr11:118M–126M region in a small
fraction of primary breast tumour samples from the TCGA dataset (73 out
of 1084 cases; cBioPortal), but we could not assess its association
with resistance to chemotherapy treatments, nor the role in the
phenotype of the individual genes lying in the locus. However, among
the genes lying in the chr11:118M–126M region is MIR100HG, a long
non-coding RNA that encodes for miR-100 and miR-125b, the latter being
a miRNA known to confer resistance to taxol treatment in TNBC cell
lines^[328]70,[329]71.
A key innovative element of our study is the use of a cutting-edge
approach combining single-cell multi-omic profiling (transcriptome and
DNA accessibility) with lineage tracing (clone information at an
arbitrary time t, which we call the baseline—P0). Specifically, we
found a putative regulatory programme (Module M1) common to both S1 and
S3, which elegantly links the two tumour-initiating states. Moreover,
the inferred transcription factors and the corresponding regulons
comprise both genes and regulatory regions, including non-coding
components (lncRNAs and enhancers). Note that these elements might be
detectable in bulk experiments, but only the single-cell resolution can
explain their relationship, which may be subpopulation-dependent. The
TF hubs of the predicted regulons are fully supported by the
literature; for instance, PRRX1, TWIST2 and RUNX2 have been linked to
breast cancer and to the EMT, a founding element shared by both tumour
aggressiveness and stem cell identity
programmes^[330]52,[331]72,[332]73.
An important question is whether the molecular traits distinctive of
TIC are specific to SUM159PT or are generalisable. In large breast
cancer datasets, the upregulation of S1 and S3 signature genes predicts
basal features, which are typically associated with cancer
stemness^[333]74. S1 signature genes have also been found as
upregulated within cell subpopulations in other breast cancer models
(MDA-MB-231 TGL), as well as associated with general cancer
meta-programmes across over 1000 primary tumour samples^[334]51. Of
note, the strongest association was found with the hybrid EMT
meta-programme, which shares several markers with S1, including TM4SF1,
used to enrich for functional TICs in SUM159PT (in this work) and in
other experimental models, such as MDA-MB-231, murine 4T1 breast cancer
cells, and MMTV-Neu tumours^[335]53,[336]54. Furthermore, by employing
either lineage tracing or single-cell omics, recent literature
highlighted programmes with remarkable similarities to the ones we
reported. In a mouse model of metastatic pancreatic cancer, Simeonov et
al. highlighted a hybrid EMT state in metastatic dissemination which
shares S100 family gene expression (see Supplementary Fig. [337]4g) and
is predictive of reduced survival in both pancreatic and lung cancer
patients^[338]20. In a genetically engineered mouse lung cancer model,
a co-accessibility module characterised by RUNX2 activity was
identified and linked with the acquisition of a pre-metastatic state
and the outcome of human lung cancer^[339]52. Interestingly, the S1
signature genes that are also significantly associated with the hybrid
EMT meta-programme are putative RUNX2 targets (see Supplementary
Fig. [340]4h). In line with this observation, the KP-tracer approach
for in vivo lung cancer lineage tracing allowed to show that tumours
evolve through stereotypical trajectories, with the transient
activation of cellular plasticity programmes and a subsequent clonal
sweep of highly fit subpopulations marked by an early or late
mesenchymal transition^[341]27. Finally, single-cell lineage tracing
revealed the underlying programme of a pool of metastatic initiating
cells in melanoma, characterised by high PRRX1 expression and promoting
the establishment of a mesenchymal-like cell state^[342]26.
Our and published evidence suggest a model where the mechanisms
influencing clonal fate and tumour evolution tend to converge towards a
common epigenetic state, often established before a challenge and,
therefore, predictable. The main programme we identified was a hybrid
EMT and the associated transcription factors (e.g., RUNX2 and TWIST).
It is worth noting that one of the results obtained with GALILEO
approach is the precise reconstruction of the network of genes and
specific regulatory elements related to the above programme in TNBC
cells. Indeed, these elements can potentially highlight cancer
dependencies with clinical impact. We foresee that combining
cutting-edge molecular tools at the genome scale, like the ones
presented here, as well as genetic ones, with suitable computational
frameworks, could critically contribute to further dissect the role
played by different transcriptional, epigenetic and genetic layers in
cancer evolution. Our study has made it clear that a multi-layered
framework is feasible and an invaluable resource to this end. Finally,
our work directly shows that both genetic and epigenetic mechanisms can
promote cancer evolution towards specific fates, and that these
mechanisms may coexist in the same tumour within specific cell
subpopulations.
Methods
Mice
All animal studies were conducted with the approval of the Italian
Ministry of Health and in compliance with the Italian law (D.lgs.
26/2014), which enforces Dir. 2010/63/EU (Directive 2010/63/EU of the
European Parliament and of the Council of 22 September 2010 on the
protection of animals used for scientific purposes) and EU 86/609
directive. Proper permit and consent were granted (Protocol No.
779/2020) by the institutional organism for ethics and animal welfare
on experimental procedures (OPBA, Cogentech). Animals were bred and
maintained under pathogen-free conditions in a controlled environment
(18–23 °C, 40–60% humidity and with 12-h dark/12-h light cycles) at the
certified Cogentech mouse facility located at IEO/IFOM campus (The FIRC
Institute of Molecular Oncology, Milan, Italy).
In vivo xenograft
Immunodeficient NOD.Cg-PrkdcscidIL2rgtm1Wjl/SzJ (also known as
NOD/SCID/IL2Rγ[c]^−/−) mice were anesthetised by intraperitoneal
injections of 1.25% solution of tribromoethanol (0.02 ml/g of body
weight). Barcode-bearing SUM159PT cells were resuspended in a mix of
14 μl PBS and 6 μl Matrigel and implanted in the fourth inguinal
mammary gland of 10-week-old animals. Mice were monitored twice a week
by an investigator. For the chemotherapy treatment studies, tumours
were allowed to grow to palpable lesions (~20–30 mm^3), then mice were
randomised into groups and each group was treated intraperitoneally
with either paclitaxel (PTX) (10 mg/kg in PBS) or vehicle (PBS) every 5
days for a total of three injections. Mice were euthanised according to
our experimental protocol and institutional guidelines when tumours
euqalled 1.2 cm in their largest diameter. Maximal tumour burden was
not exceeded. Tumour growth dynamics were monitored every 3 days by
calipers measurements. For in vivo limiting dilution assay
transplantation experiments, decreasing dilutions (1:10,000; 1:1000;
1:100) of SUM159PT were resuspended in a mix of 14 μl PBS and 6 μl
Matrigel and transplanted in the fourth inguinal mammary gland of
10-week-old animals. Animals were monitored as before and euthanized
after 1.5–3 months (depending on tumour latency).
Tissue harvest and processing
The primary tumours were removed when the tumour reached an approximate
diameter of 1.2 cm. The animals were anesthetised with tribromoethanol,
and the tumours were resected. The solid tissue was rinsed with PBS,
minced with scalpels, followed by mechanical dissociation using
gentleMACS (Miltenyi) in a digestion mix (collagenase, hyaluronidase,
5 μg ml^−1 insulin, Hepes, hydrocortisone). The cell suspension was
incubated for 25′ at 37 °C followed by an additional step of gentleMACS
dissociation. Following a wash with base medium the cells were
consecutively passed through 100, 70 and 40 μm filters. Primary tumour
cells were treated with ACK lysis buffer (Lonza) followed by
resuspension in 1% BSA/PBS and processed using the Mouse and Dead Cell
depletion kits according to the manufacturer’s directions (Miltenyi).
Cell cultures
We maintained HEK293T and MDA-MB-231 TGL cells in Dulbecco’s Modified
Eagle’s Medium (DMEM) with 10% of TET-FREE foetal bovine serum (FBS)
and 1% penicillin–streptomycin. Cells were grown in a humidified
atmosphere at 5% CO[2] at 37 °C. SUM159PT cell line and derivatives
(Asterand) were cultured in Ham’s F12 medium with 5% TET-FREE FBS,
human insulin (5 μg/ml), hydrocortisone (1 μg/ml), and Hepes (10 mM).
Barcoded SUM159PT and CRISPRi cell line medium was supplemented with
2 μg/ml puromycin for selection. Cells were grown in a humidified
atmosphere at 10% CO[2] at 37 °C.
Perturb_SUM159PT cell line generation
Perturb-seq GBC library^[343]18,[344]75 was a gift from Jonathan
Weissman (Addgene ID #85968). The library contains a random 18-nt guide
barcode (GBC) close to the polyadenylation signal of the blue
fluorescent protein (BFP). The estimated complexity of the library is
>5 million unique GBCs. We amplified the Perturb-Seq GBC library in
Escherichia coli (E. coli) ElectroMAXTM DH5α-ETM electro-competent
cells (Thermo Fisher Scientific), as indicated by the authors^[345]75.
DNA extraction was performed with NucleoBond Xtra Maxi (Macherey-Nagel)
kit according to the manufacturer’s instructions. Viruses were produced
in HEK293T at 80% of confluency with the following transfection mix:
20 μg Transfer Vector, 13 μg of psPAX2 (gag&pol) (Addgene #12260), 7 μg
of pMD2G (envelope) (Addgene #12259), 94 μL of RT CaCl[2] and water up
to 750 μL. Then, 750 μL of 2xHBS were added dropwise and 500 μL/10 cm
plate of transfection mix was added to the cells. The medium was
changed after 6 h and viruses harvested after 48 h, filtered (0.22 μm)
and ultracentrifuged for 2 h at 50,000 × g (rotor SW32 Ti, Beckman
Coulter), at 4 °C. The viral pellet was resuspended in 300 μL of PBS 1X
and stored at −80 °C. To produce the Perturb SUM159PT cell line for
lineage tracing experiments, 75,000 cells were seeded and infected with
an estimated MOI of 0.1 in the presence of 8 μg/ml of polybrene
(Sigma-Aldrich). After selection, transduction efficiency was measured
by FACS analysis, which revealed that 8.6% of cells were successfully
infected.
SUM159PT_KRAB cell line generation and CRISPRi experiments
For the generation of the PB-TRE-dCas9-KRAB plasmid, the DNA sequence
of KRAB repressor domain was amplified by PCR from the pHAGE TRE
dCas9-KRAB (Addgene plasmid #50917) and cloned in frame into the
PB-TRE-dCas9-VPR backbone (Addgene plasmid #63800) within the AscI/AgeI
sites. The cloning was sequence-verified by Sanger sequencing. SUM159PT
cells were transfected in MW6 plates, following Lipo3000 transfection
protocol (ThermoFisher Scientific) with 500 ng of transposon DNA
(PB-TRE-dCas9-KRAB) and 200 ng of SuperPiggyBac transposase helper
plasmid (SystemsBioscience). After at least 72 h from transfection,
cells were selected with 200 μg/ml Hygromycin B. The PB-TRE-dCas9-KRAB
SUM159PT cell line is referred in the text as SUM159PT_KRAB. We
expressed sgRNAs upon cloning into lentiGuide-Puro sgRNA backbone
(Addgene #52963) within BsmBI (Esp3I) sites. Lentiviruses were
generated in six-well plates, following the Lipofectamine 3000 (Thermo
Fisher) protocol. The transfection was performed by mixing the
construct of interest, psPAX2 (gag&pol)(Addgene #12260) and pCMV-VSV-G
(envelope) (Addgene #8454) plasmids at a ratio of 4:3:1. Viruses were
collected after 24 h, filtered and frozen. We generated stable cell
lines expressing single sgRNAs by lentiviral infection of 150,000/well
SUM159PT_KRAB cells. Lentiviral supernatants (1:3 dilution) were added
to cells, supplemented with 1 μg/ml of polybrene. After 24 h, cells
were selected with 2 μg/ml puromycin (Gibco). For the CRISPRi
experiment, we plated sgRNA-expressing stable cell lines with 100 ng/ml
of doxycycline and harvested cells after 3 days.
Paclitaxel treatment
A stock aliquot of paclitaxel (obtained from the IEO hospital) was
prepared (70 μM) in PBS and used to treat cells. The dose of paclitaxel
used for in vitro experiments was established by a dose–response curve
of SUM159PT treated for 72 h. IC50 concentrations were estimated by
parallel fit estimation (JMP software, n = 4). For short-term,
single-treatment experiments, paclitaxel was used at the final
concentration of 50 nM (~IC95). After 3 days of treatment, the medium
was changed every 3 days without adding the drug. For the analysis of
the susceptibility of pre-treated cells (shown in Fig. [346]6d),
SUM159PT cells were treated with paclitaxel (50 nM) and surviving
clones were allowed to recover for 21 days. Subsequently, cells were
infected with the GBC library, sorted by BFP expression and subjected
to a second round of treatment (50 nM). For the generation of the drug
resistant model (shown in Fig. [347]6e), SUM159PT cells were treated
multiple times with increasing doses of paclitaxel (10, 20, 50, and
100 nM). Drug-adapted cells were able to survive even when treated with
100 nM paclitaxel and were collected 90 days after treatment for WES
analysis.
TM4SF1^high flow cytometry
SUM159PT_KRAB cells were stained with anti-TM4SF1-APC (clone: REA851
Miltenyi Biotec) antibody for 10′ at 4°C, in the dark and sorted using
Fusion Aria Sorter. The bulk population was FAC-sorted using FSC/SSC
gate, while for the TM4SF1^high subpopulation different gating
strategies were tested (top 5% or top 10% APC fluorescence intensity as
shown in Supplementary Fig. [348]5d); top 5% showed the best enrichment
for S1 markers (shown in Supplementary Fig. [349]6b) and was used in
subsequent experiments. After passive propagation in vitro, 9 days
after sorting, both populations were transplanted into
NOD/SCID/IL2Rγ[c], mice and tumours were collected as described above.
RNA sequencing
RNA was extracted from mouse-depleted and dead-cell-depleted samples.
The bulk and TM4SF1^high sorted cells were either immediately processed
by RNA-seq or passively propagated in vitro for 43 days (corresponding
to passage 19). Cells after 9 and 43 days from sorting were also
collected and finally processed by RNA-seq. RNA was harvested using
Maxwell RSC miRNA Tissue Kit according to the manufacturer’s protocol.
Libraries for RNA-seq were prepared from 1 μg of total RNA using Truseq
Total RNA Library Prep Kit (Illumina) following the manufacturer’s
protocols. Samples were sequenced paired-end 50 bp on an Illumina
Novaseq6000 instrument.
Whole-exome sequencing
Sequencing was performed with the Agilent SureSelect All Exon v5
(experiments shown in Fig. [350]6c, d) or v7 (experiments shown in
Fig. [351]6a, b), as per manufacturers’ instructions. Libraries were
sequenced with coverage >200X on a Novaseq 6000, with a PE 100 reads
mode.
Multiseq—single-cell time-course for assaying drug tolerance
The single-cell time-course experiment for drug tolerance (shown in
Fig. [352]7) was designed in order to process all time points at the
same time. To do so, we performed an en-reverse experiment (i.e.,
starting from the last time point, d15). The same batch of cells was
used for the entire experiment. Cells were passaged every 2 days and
seeded either for passive culture or paclitaxel treatment (5 × 10^6
cells in 150 mm dishes). After 3 days of treatment (50 nM as described
above), the medium was changed every 3 days without adding the drug. At
day 15, all paclitaxel-treated time points were collected together,
counted and processed with the Multiseq protocol^[353]76. Briefly,
500,000 cells for each sample (d5, d7, d9, d11, d13, d15) were
resuspended in 180 μL of PBS and then labelled with 20 μL of a reaction
mix composed of 2 μM of a sample-specific Barcode, 2 μM of LMO-Anchor
(a kind gift of the Gartner Lab) and PBS. After 5′ of incubation in
ice, we added 20 μl of reaction mix composed of 2 μM of co-anchor in
PBS. After 5′ of incubation we stopped the reaction adding PBS/BSA 1%.
We centrifuged and washed twice the cells before resuspending each
sample in 125 μl PBS/BSA and pooling. After filtering, 25,000 cells
were loaded on the Chromium Controller. Multi-seq library was isolated
from the amplified cDNA and sequenced at 5000 barcode reads/cell depth.
RT-qPCR
Total RNA was extracted using Maxwell RSC miRNA Tissue Kit according to
the manufacturer’s protocol. One microgram of total RNAs was
reverse-transcribed using ImProm-IITM Reverse Transcription System
(Promega) and genes were analysed with Quantifast SYBR green master mix
(Qiagen). RPLP0 was used as a housekeeping gene. The complete list of
primers used in this study is reported in Supplementary Data [354]18.
GBC library preparation from gDNA
The genomic DNA was extracted from 1 to 3 million cells (typically 3
million, coverage 300×) using the NucleoSpin Tissue kit
(Macherey-Nagel). To enrich for GBCs, six parallel PCR reactions were
performed in a final volume of 50 μl using 200 ng of genomic DNA,
0.2 μM of dNTPs mix, 0.5 μM of the following primers: F1:
GGGTTTAAACGGGCCCTCTA and R4: GCCTGGAAGGCAGAAACGAC and amplified using
Phusion^TM High-Fidelity DNA Polymerase at the final concentration of
0.02 U/µl (Thermo-Scientific) (coverage 50×–500×), in its 5X Phusion HF
buffer according to the following PCR protocol: (1) 98° for 30 s, (2)
98° for 7 s, then 60° for 25 s and 72° for 15 s (for 30 cycles), (3)
72° for 5 min.
At the end of the PCR, reactions were pooled and purified using
QIaquick PCR purification kit (QIAGEN). Then, the eluted DNA samples
were run on a 2% agarose gel and the 280 bp band was purified using the
QIAquick Gel extraction kit (QIAGEN). Illumina libraries were generated
from 10 ng of DNA, which was blunt-ended, phosphorylated, and tailed
with a single 3′ A. An adapter with a single-base “T” overhang was
added and the ligation products were purified and amplified to enrich
for fragments that have adapters on both ends.
Libraries with distinct adapter indexes were multiplexed and sequenced
(50 bp paired-end mode) on a Novaseq 6000 sequencer.
Single-cell library and GBC sublibrary preparation (scRNA or snRNA assays)
Single-cell suspensions (500–1000 cells/μl) were mixed with reverse
transcription mix using the 10x Genomics Chromium Single-cell 3′
reagent kit protocol V2 (T0 MDA-MB-231 TGL and paclitaxel time-course
exp2) or V3.1 (T1, paclitaxel time-course exp1) and loaded onto 10x
Genomics Single-Cell 3′ Chips ([355]www.10xgenomics.com). Libraries
were generated as per manufacturers’ instructions and sequenced on
Illumina Novaseq 6000 Sequencing System (with a single- or
dual-indexing format according to the manufacturer’s protocol V2 or
V3.1). We aimed at a coverage of 50 K reads/cell in each sequencing
run. Multiome experiments were performed with the Chromium Single-Cell
Multiome ATAC + Gene Expression Reagent Kits (V1). Nuclei suspensions
(2000 nuclei/μL) were transposed and loaded onto Chromium Next GEM Chip
J Single-Cell. Libraries were generated as per manufacturers’
instructions and sequenced on Illumina NOVAseq 6000, aiming at 50 K RNA
and 50 K ATAC reads/cell. To enrich for GBC reads, in a final volume of
50 μL, we amplified by PCR the Perturb-library cassette from 5 ng of
the amplified cDNA (as in ref. ^[356]18) using 0.3 μM of dual-indices
primers (forward: 5′-AATGATACGGCGACCACCGAGATCTACACCTCCAAGTTCACACTC
TTTCCCTACACGACGCTCTTCCGATCT-3′; reverse: 5′-CAAGCAGAAGACGGCATACGAG
ATCGAAGTATACGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTAGCAAACTGGGGCACAAGC-3′)
and amplified using Q5 2X master mix (M0541S, NEB) according to the
following protocol: (1) 98° for 10 s, (2) 98° for 2 s, then 65° for 5 s
and 72° for 10 s (25 cycles), (3) 72° for 1 min. The fragment band of
the expected length (350–425 bp) was purified using EGEL 2% Power Snap
Electrophoresis System (Thermo-Scientific) and checked at bioanalyzer
before sequencing.
sgRNAs list
CRISPRi sgRNAs sequences targeting the two putative enhancers of COL6A1
and COL6A2 were designed using the web tool CRISPick (Broad Institute).
The design region was defined by merging the ATAC module regions with
the overlapping H3K27Ac signal from Encode (as shown in Fig. [357]5b)
and exploiting the CRISPRi Range format for unstructured targeting
provided by CRISPick. Inputs were: Enh_1
[358]NC_000021.9:+:46048857-46050195 and Enh2
[359]NC_000021.9:+:46052113-46053945. Selected sgRNAs were named
according to the position relative to the start of the design window.
We selected sgRNAs to cover the entire design window, choosing the
sequences with higher on-target activity and filtering out those with
potential off-target complementarity. As a negative control, we
included three scrambled-sequence-sgRNAs for the CRISPRi experiments
selected from the previous genome-wide CRISPRi screening library
designed by the Weissman lab^[360]77. The sgRNAs chosen to target the
promoters of COL6A1 and COL6A2 were also chosen from the same study.
The list of sgRNAs sequence is reported in Supplementary Data [361]18.
Genetic barcode analysis
Genetic barcode calling
A GBC library is built for each sample separately, starting from the
FASTQ files, in two steps. First, the 18nt-long sequences located in
the GBC locus are extracted using seqkit amplicon command from seqkit
v2.1.0^[362]78, with either 23–40nt- (DNA reads) or 12–29nt-long (cDNA
reads) flanking regions and allowing one mismatch. A set S of sequences
of length 18 is obtained and each s
[MATH: ∈ :MATH]
S is assigned a weight w(s), corresponding to its frequency in the
FASTQ file. Note that, since the relative GBC abundance in a sample is
unknown, not accounting for sequencing errors can result in an inflated
estimate of sample clonality. Only sequencing errors in the form of
mismatches are considered. The underlying assumption is that, if s,s′ ∈
S are sequenced from the same GBC species and carry d and d′
mismatches, respectively, and that d < d′, then w(s) ≥ w(s′). Thus,
each GBC species i is associated with a subset S[i] ⊆ S, where the true
GBC sequence is the s ∈ S[i] with maximal weight. The frequency of i is
the cumulative weight of the sequences in S[i]. We infer the set of
“true” GBC sequences and simultaneously correct for sequencing errors
with the following procedure. Let G = (V, E, w) be an undirected graph,
where V = S, E is the set of edges connecting sequences at Hamming
distance ≤D, and w(s) is the frequency of s. We iteratively detect a
collection of disjoint subgraphs G[i] = (S[i], E[i]) induced by S[i],
called stars, where a node s ∈ S[i] is the hub and all the other nodes
are neighbours of s. Stars are defined according to the following
conditions: (i) the hub s has maximum weight w(s) in S[i], (ii) the
cumulative weight across stars is maximal, and (iii) the fraction of
neighbours of s in G that do not belong to S[i] is < f, where 0 ≤ f < 1
is a fixed parameter. We compute a heuristic solution with a greedy
approach. First, nodes are ordered by non-increasing weight. At each
iteration, a new star is created, whose hub is the first node in the
list and the other nodes are its neighbours, and the included nodes are
removed from the list. The procedure ends when the first star that
violates (iii) is found. We set D = 1 and f = 0.2. The final set of
true GBC sequences is defined as the collection of detected hubs and
their frequency is the cumulative weight across stars. We approximate
the clone content of bulk DNA-Seq sequencing samples as the set of GBCs
and their associated frequencies.
Definition of tumour-initiating clones and drug-tolerant clones
In bulk DNA-seq samples, we approximate clone content with GBC content
and clone abundance with count per million reads (CPM). Clones whose
frequency significantly differs between conditions are determined as
follows. For each clone, we test if the CPM in the condition sample
(treated or untreated) is significantly greater compared to the average
across control samples (parental), using a Fisher’s exact test. The
universe is defined as the union of all clones across control and
condition samples. P values are adjusted using the Bonferroni
correction method. Clones with adjusted p value < 0.05 in at least
[MATH:
⌈12n⌉+1
:MATH]
condition samples are labelled TICs (if condition is “untreated
tumour”) or DTCs (if condition is “treated sample” or “treated
tumour”), where n is the number of condition samples.
Single-cell RNA-seq data analysis (SUM159PT)
Cell barcode calling
A GBC reference is defined as the union of the GBC species obtained
from cDNA reads, as described in the section “Genetic barcode calling”,
across all samples in the same experiment. Read alignment, UMI
counting, cell-containing barcode (CB) calling, and GBC counting are
performed using cellranger count from CellRanger v6.0 on the human
reference genome GRCh38 v2020-A and the GBC reference. Parameter
--expect-cells is set to 7500 (T0), 20,000 (T1 and exp1), and 3000
(exp2). Feature-CB count matrices are obtained, where features denote
either genes or GBC species, and entries are UMI counts. Cellular
barcodes with <5000 (T1 and exp1) or 10,000 (T0, exp2) gene UMIs are
filtered out. Data post-processing is done with R^[363]79.
MULTI-seq sample demultiplexing
Samples belonging to T1 and exp1 are demultiplexed using MULTI-seq
barcodes (MBCs), as follows. MBC-containing reads are aligned to the
reference barcode sequences using the R package deMULTIplex
v1.0.2^[364]76. MBC-CB count matrices are obtained. Each MBC univocally
identifies a sample, and a sample can be labelled with multiple MBCs.
The automatic quantile-based thresholding procedure implemented in
deMULTIplex fails to assign a unique MBC label to most of cellular
barcodes in exp1, thus a custom demultiplexing procedure is used for
both experiments. First, all MBCs with UMI count <2 are removed from
every CB. For every cellular barcode i, an MBC j is marked as detected
if c[j] ≥ 15, c[j]/c[max] ≥ 0.5, and p < 1e − 5, where c[j] is the UMI
count of j in i, c[max] is the UMI count of the top abundant MBC in i,
and p is the probability of observing a value equal or greater than
c[j]given a Poisson distribution with λ = average UMI count of j across
cellular barcodes in the sample. Only cellular barcodes with exactly
one detected MBC are retained and assigned to the corresponding sample.
Clone detection
Expressed GBCs in CBs are identified using the same procedure applied
to MBCs and described in the section “MULTI-seq sample demultiplexing”,
using c[j] ≥ 10 and c[j]/c[max] ≥ 0.3. A p value threshold is set only
for samples d11–13–15 for exp1 (p < 1e − 5) and sample d17 for exp2
(p < 1e − 10). Note that GBC frequency at earlier time points is very
low, hence the p value criterion has no effect. CBs can be assigned 0,
1, or >1 GBCs, the latter being an effect of multiple cell
encapsulation (doublets) or multiple GBCs infecting the same cell
(co-infection). We extract clone information from multi-GBC CBs by
distinguishing co-infection and doublet events. High UMI doublets are
removed using a sample-specific cutoff. Assuming that GBC expression is
approximately constant across GBC species and across cells, differences
in GBC UMI count are essentially due to droplet-specific mRNA capture.
Moreover, the transcriptome size of a cell line model should be
constant as well. We deduce that a single cell infected with k GBCs
should display a higher GBC UMI fraction compared to that of multiple
cells encapsulated in the same droplet that jointly account for k
infection events. For simplicity, GBC species for multiple infection
events within the same droplet are assumed to be pairwise distinct. The
procedure works as follows. First, each CB set with k expressed GBCs is
sorted by non-increasing values of c[i]/u[i], where c[i] and u[i] are
the GBC UMI count and the gene UMI count for CB i, respectively. We
obtain a list of sets S[2], …, S[m], where m is the maximum number of
distinct GBC species expressed in a CB. Then, we iteratively classify
the CBs of each set S[k] separately, starting from the smallest k. A CB
i is labelled as co-infection in two cases: (a) all GBCs expressed by i
are also expressed in at least five CBs, including i, or (b) the
“doublet probability” of i (i.e., the cumulative sample frequency of
each single GBCs expressed in i) is below a sample-defined threshold.
If neither (a) nor (b) hold, i is labelled as doublet. The procedure
continues until the fraction of co-infection events in multi-GBC CBs is
≥1-D, where D is the expected doublet fraction in multi-GBC CBs. D is a
sample-specific doublet rate estimate based on 10× guidelines and the
number of called CBs. The clone pool of a single-cell sample is defined
as the collection of single GBCs that occur in single infection and
doublet events with k = 2 GBCs, plus all multi-GBC sets from
co-infection events.
Clone comparison between single-cell and bulk samples
To assess the accuracy of clonality estimates from bulk samples, we
compare the GBC species content between bulk DNA and single-cell RNA
GBC sequencing samples from the same condition. We define the frequency
of a GBC species in a bulk sample by CPM. Then, for a given pair (X[b],
X[s]) of bulk and single-cell samples, we define the value y = f(x) as
the fraction of GBC species with frequency ≥x in X[b] found as
expressed in at least one cell of X[s]. In practice, we compute f(x) in
steps of 20 CPM units. Consistent clone estimates in bulk and
single-cell samples in a condition result in a monotonically
non-decreasing trend of f(·). A cell c is classified as
tumour-initiating or drug-tolerant (see section “Definition of
tumour-initiating clones and drug-tolerant clones” above) if and only
if all GBCs expressed in c are tumour-initiating or drug-tolerant,
respectively. Conversely, the single-cell cluster labelling is
transferred to a bulk sample as follows: the GBC abundance (CPM) in the
bulk sample is first normalised by its average abundance at baseline;
then, the abundance of cluster C in the bulk sample is calculated as
the contribution of all GBCs expressed by any cell in C.
Quality filtering and normalisation
All CBs with no expressed GBCs or classified as doublets are removed
from subsequent analysis. Samples from technical replicates (same vial)
are concatenated and processed using Seurat v4.0.5^[365]80. UMI counts
are added a pseudocount = 1, divided by library size, multiplied by
10,000, and log-transformed (natural logarithm), to obtain
log-normalised counts.
Dimensional reduction and clustering
Highly variable genes detection is performed on log-normalised counts
using two different approaches: min.var.plot and vst. We set xmin = 0.1
and xmax = 10 for min.var.plot. The input parameters for each algorithm
are let vary in a pre-defined set: dispersion.cutoff in {0.5, 1, 1, 5}
for min.var.plot, and nfeatures in {1000, 2000, 5000} for vst. A
cell-cycle score is computed using the Seurat function CellCycleScoring
with default parameters. We observe a high cell-cycle effect in
parental samples (T0 and T1); hence, the cell-cycle score computed as
above is regressed out before scaling and centreing. PCA is performed
on z-scores of the log-normalised counts on the reduced space of highly
variable genes. A total of 50 PCs is computed. The optimal number n of
PCs to retain for clustering is defined as the minimum n such that the
standard deviation explained by the nth PC exceeds 50% of the average
across the 40–50th PCs. Multiple Louvain clustering runs are performed
on the selected PCs, for each highly variable gene set, by varying the
number of neighbours k in the knn graph and the resolution parameter r
in a pre-defined set: k ∈ {30, 40, 50} and r from 0.1 to 0.8 in steps
of 0.1. A second and third round of highly variable gene selection,
PCA, and clustering are possibly performed after the removal of small
clusters with very low UMI count, found in specific solutions, until
the average UMI count is homogeneous across clusters. We redo the whole
clustering procedure instead of just removing the small, low-quality
clusters, because they are usually outliers in the expression profile
of the sample and can affect the definition of highly variable genes.
We obtain 9395 and 13,562 cells for T0 and T1, and 7884 and 10,698
cells for exp1 and exp2, respectively. For T0 and T1, (i) a silhouette
score is computed for each clustering solution on the Euclidean
distances calculated on the same PCs used for clustering and (ii) a
ROGUE score^[366]81 is computed for each clustering solution on the UMI
counts, using default parameters, except for span =0.6. The clustering
solution with the highest silhouette score is defined as optimal. To
increase the number of detected clusters, we set a higher value for the
resolution parameter, while maintaining the other parameters of the
optimal solution unchanged (the highly variable gene set and the number
of neighbours in the knn graph). The final clustering solutions for T0
and T1 are obtained with vst method by setting nfeatures = 1000 and
clustering parameters k = 40 and r = 0.5, resulting in seven clusters
for both experiments. Resolution 0.5 turned out to be a good compromise
between silhouette width and ROGUE score on both T0 and T1. For exp1,
we obtain three clusters using the following settings: vst method,
nfeatures = 5000, k = 30, and r = 0.1. UMAP reduction^[367]82 is run
with RunUMAP using default parameters on the same input used for
clustering.
Differential expression analysis
A differential expression analysis is run between conditions via
FindMarkers function from Seurat with MAST method^[368]83 accounting
for sample ID as a covariate. We only test for genes detected (UMI
count > 0) in at least 10% cells in either condition such that
|log[2](FC)| ≥ 0.25. A gene is defined as differentially expressed if
its adjusted p value is < 0.05 (Bonferroni correction) between the two
conditions.
Computation of cell–cell distance distributions
To evaluate the relationship between clonality and gene expression at
basal state, we compare sister cells (same clone) and non-sister cells
(different clones) at T0 and T1. We note that a sister cell similarity
measure within the same experiment would be biassed towards clones with
vial-specific frequency >1, thus we opt for a comparison between
experiments (1886 common clones). Cells of T0 and T1 are projected on a
common latent space using canonical correlation analysis (CCA)^[369]84,
using the intersection of highly variable gene sets (defined as in the
section “Dimensional reduction and clustering”) between experiments. By
definition, this integration approach removes dataset-specific
complexity by keeping only shared correlations, hence the contribution
of the experiment covariate to cell–cell similarity should be minimised
compared to more conservative approaches. Inter-sample cell–cell
Euclidean distances are computed on the CCA latent space on sister cell
pairs and a random subset of non-sister pairs of the same size. A
Gaussian kernel density estimation is computed for sister and
non-sister distances separately to obtain a global distance
distribution. Then, to evaluate the relationship between and within the
DTC and the non-DTC pool before and after treatment, we computed the
Euclidean distances between cells at T1 (baseline) and in exp1 (after
treatment); finally, we evaluated the relationship between clonality
and gene expression on the treated condition by computing intra-sample
Euclidean distances at day ≥13 on the PC space of exp1 and exp2, as
defined in the section “Dimensional reduction and clustering”. We
obtained Gaussian kernel density estimates for both cell–cell distance
matrices as above.
Computation of clone sharedness and detection of cell subpopulations
To assess whether sister cell similarity is uniform across clones or is
rather clone-specific, we introduce a clone sharedness score across
clusters (see section “Dimensional reduction and clustering”). It is
defined, for each pair of clusters i and j in T0 and T1, as cs(i,j) =
(n[ij]·n′[ji])/(N[i]·N′[j]), where n[ij] (n′[ji]) is the number of
cells in i (in j) with at least one sister in j (in i) and N[i] (N′[j])
is the total number of cells in i (in j). Each cluster i in T0 is
matched with the cluster j^max in T1 with maximum clone sharedness with
i, namely, j^max = arg max[j] cs(i,j). The three clusters belonging to
maximal clone sharedness pairs (i,j^max) are defined as subpopulations
and labelled by non-increasing value of cs(i,j^max) as S1, S2, and S3.
Transcriptionally stable clones are defined as the ones only found in
one of the three subpopulations in both T0 and T1, they are 437 in
total. This approach is independent of single-cell data integration,
whose accuracy is difficult to assess when reliable markers are
unknown^[370]85,[371]86. We define the gene signature of S ∈ {S1, S2,
S3} as the set of genes that are differentially upregulated between S
and its complement in both T0 and T1. We obtain 109, 29, and 27 genes
for signatures S1, S2, and S3.
Definition of clone–clone pair propensity
We define a pair propensity score to measure the pairwise gene
expression similarity bias between clones. For each cell, we consider
its directed neighbourhood of size k, i.e., each cell has exactly k
nearest neighbours and the neighbour relationship is not symmetric.
Given two clones labels i and j, possibly identical, the observed (i,j)
frequency f[ij]^obs is the number of cell pairs (c[i],c[j]) such that
c[i] and c[j] belong to clones i and j, respectively, and c[j] is a
neighbour of c[i]. The expected (i,j) frequency f[ij]^exp is calculated
based on the frequency of clones i and j and the neighbourhood size k,
as follows. Given the neighbourhood of c[i], the probability that a
given cell c ≠ c[i] in the neighbourhood is labelled with j is given by
(n[j] − I(i,j))/(N − 1), where n[j] is the number of cells labelled
with j, N is the total number of cells in the sample, and I(i,j) is the
indicator function (I(i,j) = 1 if i = j and I(i,j) = 0 if i ≠ j). We
obtain f[ij]^exp = n[i]·k·(n[j] − I(i,j))/(N − 1), where n[i] is the
number of cells labelled with clone i. The pair propensity of i and j
is defined as p[ij] = f[ij]^obs/f[ij]^exp. By iterating across all
clone pairs, we obtain a non-symmetric n × n matrix P, where n is the
number of distinct clone labels. The value of p[ij] indicates the
propensity of clones i and j to be closer (p[ij] > 1) or farther
(p[ij] < 1) from each other in the sample than expected by chance,
where p[ij] = 1 denotes random association. Finally, we obtain a
symmetric matrix
[MATH:
P′=(P
TP)12
:MATH]
.
Detection of clone lineages
We use the above pair propensity definition to find groups of clones
with mutually similar gene expression profiles on the treated condition
at day ≥13. We first define the neighbour relationship, as follows. The
top 1000 highly variable genes are computed with vst method on the
union of the time points in exp1 and exp2, respectively. For each
sample, a knn graph with k = 40 is built on the top n PCs (see section
“Dimensional reduction and clustering”). Matrix P′ (see section
“Definition of clone–clone pair propensity”) is computed on the top 20
highest frequency clones, clones i such that p[ii] < 1 are discarded,
and the matrix obtained is clustered using the hclust function with
ward.D2 method on Euclidean distances. We obtain two clusters A and B
that serve as “anchor” to add the remaining clones. Clones not yet
included in A nor in B are sorted by non-increasing frequency and we
iteratively add one clone at a time, as follows. The first clone i in
the ordering is considered and a pair propensity p[iA] (p[iB]) is
computed between i and the union of clones in A (in B). If p[iB] < 1
and p[iA] > p[iB] + 1, then i is added to A (likewise for B), otherwise
it is discarded. The procedure continues until the last clone in the
ordering is considered. Starting from the sets A and B computed on each
of the four samples, we define lineage L1 (lineage L2) as the set of
clones always assigned to A (to B) and detected at least once in both
exp1 and exp2. We detect 11 clones for L1 and 17 clones for L2. We
define the gene signature of L1 (of L2) as the set of genes that are
differentially upregulated (downregulated) between cells in L1 and L2
in both exp1 and exp2. We obtain 42 and 46 genes for L1 and L2.
Functional annotation
We perform pathway enrichment analysis (REACTOME^[372]87) as follows.
Genes with official gene symbols are first converted to ENTREZ
identifiers, using limma v3.49.5^[373]88 and clusterProfiler
v4.1.4^[374]89 packages, respectively. Enrichment significance is
calculated with a Fisher’s exact test using the enrichPathway function
from clusterProfiler. Benjamini–Hochberg method is used to correct for
multiple testing. We perform Gene Set Enrichment Analysis
(GSEA^[375]90) as follows. Given a query case-control, the list of
genes (universe) is ordered by non-increasing log[2](FC) values; GSEA
is run with GSEA function from clusterProfiler, with nPerm = 1000
sample permutations. MYC activity was computed with ModuleScore
function from Seurat on T0 and exp1 cells using HALLMARK_MYC_TARGETS_V1
and HALLMARK_MYC_TARGETS_V2 ([376]www.gsea-msigdb.org) as MYC target
gene lists.
Single-cell RNA-seq data analysis (MDA-MB-231 TGL)
Read alignment, UMI counting, and CB calling are performed as for
SUM159PT cells, with parameter --expect-cells set to 3000. CBs with
< 200 detected genes (UMI > 0) and genes detected in < 3 cells are
filtered out. Normalisation, scaling, cell-cycle regression,
dimensional reduction, clustering and differential expression analysis
are performed as described for SUM159PT cells. The final clustering
solution is obtained with vst method by setting nfeatures = 1000 and
clustering parameters k = 40 and r = 0.5, resulting in seven clusters.
Single-cell Multiome data analysis
Cell barcode calling
A GBC reference is defined as the union of the GBC sets computed from
the scRNA-seq reads, as described in the section “Genetic barcode
calling”. Read alignment, UMI count, peak calling, and CB calling are
performed using cellranger-arc count from CellRanger v6.0 on genome
reference GRCh38 v2020-A-2.0.0. GBC counts are extracted using
cellranger count, setting --expect-cells to 6200 (M0_1) and 8800 (M0_2)
and --include-introns. CBs with < 10,000 UMIs are filtered out.
Clone detection and quality filtering
Expressed GBCs in CBs are identified using the same procedure applied
to scRNA-seq samples and described in the section “Clone detection”,
using c[j] ≥ 10 and c[j]/c[max] ≥ 0.3, and no p value threshold. All
CBs with no expressed GBCs or classified as doublets are removed from
subsequent analysis. We obtain 2446 and 2377 nuclei for M0_1 and M0_2,
respectively.
Gene expression analysis
Cell clusters are defined using gene expression information only, on
M0_1 and M0_2 separately. UMI count normalisation is calculated as for
scRNA-seq datasets with Seurat v4.0.6. Highly variable genes are
selected using FindVariableFeatures function from with method = vst and
nfeatures = 5000. PCA is performed on the set of highly variable genes
using RunPCA function with default parameters. The first 30 PCs are
used to compute a Shared Nearest Neighbours graph via the FindNeighbors
function, then FindClusters function is used to cluster the cells,
using the SLM algorithm with resolution ranging between 0.2 and 1.2 in
steps of 0.1. A silhouette width is calculated for each clustering
solution on the Euclidean distance matrix computed on the same input
used for clustering. Subpopulations S1, S2, and S3 in the two
replicates are detected by matching the clusters to T0 and T1 using the
sharedness score (see section “Computation of clone sharedness and
detection of cell subpopulations”). The selected resolution value is
the one that maximises the average silhouette width while keeping the
three subpopulations distinct. We select resolution 0.4 and 0.6 for
M0_1 and M0_2, respectively, and obtain 7 and 6 clusters. UMAP
projection is calculated using RunUMAP with default parameters on the
same input used for clustering. Cluster markers are extracted with the
Wilcoxon rank-sum test implemented in FindAllMarkers function using
default parameters. Differentially expresseSingle-cell RNA-seq data
analysis (SUM159PT)”. Subpopulation gene signatures are defined as for
the scRNA-seq samples.
Chromatin accessibility analysis—data preprocessing
The region-cell count matrices, where regions are ATAC peaks and
entries are fragment counts, are processed as follows. Raw fragment
counts are normalised using term frequency-inverse document frequency
(TF-IDF), which assigns higher importance to highly cell-specific
regions. The active regions (fragment count ≥ 1) in at least ten cells
are selected using the FindTopFeatures function from Signac
v1.7.0^[377]91. Latent semantic indexing (LSI) is applied to reduce the
dimensionality of the dataset. We compute 50 LSI dimensions. We keep
dimensions from 2 to 50, as dimension 1 shows a positive correlation
with sequencing depth. UMAP projection is calculated on the same LSI
space, using default parameters.
We define pairs of regions associated with the same transposition event
between replicates as those lying at a distance ≤ d on the genome.
findOverlap function from GenomicRanges v1.46.1^[378]92 is used to
determine such pairs, by varying the gap size (i.e., the value of d)
from −1000 (overlap) to 1000 (padding) in steps of 10. Depending on d,
each region in M0_1 can have 0, 1 or multiple matching regions with
M0_2, and vice versa. The idea is that a low d would fail to recognise
regions stemming from the same transposition event, whereas a high d
would result in many spurious matches. Low (high) values for d result
in few (many) multiple matches. The value of d is chosen such that both
u[d] and u[d]/(N[d] −u [d]) are maximised, where u[d] is the number of
unique matches between M0_1 and M0_2 and N[d] the total number of
matches. We select d = 0 and obtain 120,414 and 120,377 matched regions
in M0_1 and M0_2, respectively.
Chromatin accessibility analysis—topic modelling
We use a topic modelling approach to group regions that are
consistently open in the same sets of cells while reducing data
sparsity. Given a region-cell matrix, topic modelling outputs a set of
topics, where each region has a probability of being assigned to a
topic and each topic has a probability of being assigned to a cell.
Modelling chromatin accessibility in this way has three advantages:
first, the number of topics is typically orders of magnitude smaller
compared to the number of regions; second, a cell’s epigenome is
expressed as the contribution of multiple topics; third, the importance
of a region in a cell is interpretable, namely, as the combination of
the weight of the region in a topic and the contribution of that topic
in the cell’s epigenome.
Specifically, we use the Latent Dirichlet allocation (LDA) model
implemented in cisTopic v0.3.0^[379]56 on each replicate separately, on
the set of matching regions between the two replicates (both unique and
multiple matches, see section “Chromatin accessibility analysis—data
preprocessing”). The input to cisTopic is a raw region-cell count
matrix, binarised by setting a cutoff at one fragment per region per
cell. cisTopic is run using a total number of 1000 iterations and a
burn-in period of 500 iterations to the Collapsed Gibbs Sampler. The
procedure is repeated by varying the number of output topics n between
10 and 50 in steps of 10. We select the maximum log-likelihood solution
and obtain n = 40 for both M0_1 and M0_2. Topics with Pearson
correlation coefficient >0.5 between topic-cell probability and
fragment count are discarded. We select 31 topics for M0_1 and 34
topics for M0_2.
Chromatin accessibility analysis—detection of ATAC modules
To select the topics that represent robust biological signals, we
compute a topic reproducibility score between replicates, as follows.
To this aim, an IDR^[380]93 is calculated between every pair of topics
in M0_1 and M0_2, respectively, on region-topic probabilities, using
idr v2.0.3 tool with parameters --peak-merge-method max --idr-threshold
0.05 --max-iterations 100. Given topics t[1] and t[2] in M0_1 and M0_2
and a region r, the IDR statistic expresses the probability that the
region-topic probabilities of r in t[1] and t[2] are different. We say
that r is reproducible between t[1] and t[2] if IDR < 0.05. For each
topic pair (t[1],t[2]) in M0_1 and M0_2, we define the reproducibility
score as the weighted mean between the number of reproducible regions
and the 75th percentile of min {−125 log[2](IDR), 1000} across those
regions. We define the sets of regions with IDR < 0.05 in topic pairs
with reproducibility score > 0 as ATAC modules. For each cell and for
each module, we compute a module AUC (Area Under the Receiver Operating
Characteristic curve) score, where the TF-IDF value is the predictor
variable and the membership to the module is the response variable,
using the auc function from pROC v1.18.0^[381]94 and direction “<”. The
set of regions resulting from the union of all ATAC modules were
annotated using the ENCODE registry of Regulatory Elements v2
(cCRE)^[382]95 with a minimum overlap of 1 bp.
Relationship between epigenetic modules and subpopulations
We detect the ATAC modules explaining specific cell subpopulations as
follows: for each ATAC module, we compute an AUC score where the
predictor variable is the module AUC defined above and the response
variable is either the subpopulation membership (S1, S2, S3) or the
phenotype (TIC,DTC). Then, we compute Spearman’s ρ between the module
AUC for each of the top 40 reproducible ATAC modules and each gene
among those detected in at least 20 cells in the highly variable gene
pool (vst method, nfeatures = 5000) across all nuclei. We annotate each
gene according to whether they belong to S1, S2, and S3 signatures
(obtained from both scRNA-seq and Multiome gene expression data) or are
annotated as human TF by Uniprot^[383]96 (GO:0003700, taxon = human,
gene product type = protein; 1435 TF in total). To identify putative
cis-regulatory regions, we extract the genes in the transcriptional
signatures of S1, S2, and S3 that are proximal to reproducible regions
in epigenetic modules, with a neighbourhood size of ±50 kbp around the
region, using findOverlaps function from GenomicRanges package.
Identification of enriched TFs
For a given module, we extract two sets of genes according to the
following criteria: (i) Spearman’s ρ ≥ 0.2 between the module AUC and
gene expression across nuclei (see above), or (ii) gene locus lying
≤100 kb away from any region in the module, as by GREAT 4.0.4^[384]97
analysis, using the basal plus extension gene regulatory domain
definition. These two sets are separately used as input to ChEA
v.3^[385]98 and the output ranked list of enriched TFs is obtained.
Whole-exome sequencing data analysis
Read alignment, variant calling, and extraction of reproducible CNVs
Reads were aligned with BWA (-t 16) v0.7.17^[386]99 and CNVs were
called with CNVkit v0.9.8^[387]100 with default parameters, using the
parental SUM159PT cells as a normal reference sample and providing the
appropriate Agilent bed file (v5 or v7) as target. The chromosomal CNVs
detected with CNVkit were retrieved. CNVs were intersected across
replicates using bedtools intersect from bedtools v2.30.0^[388]101,
requiring ≥1 bp overlap. Only regions covered by all replicates are
retained. A CNV consensus is then defined as the set of regions that
span >80% of a CNV in each replicate.
Comparison with single-cell sequencing data
Regions from scATAC-seq assays of M0_1 and M0_2 are intersected with
CNV consensus regions using subsetByOverlap function from GenomicRanges
v1.45.0 with default parameters. Genes from scRNA-seq assays are
intersected with CNV consensus regions using the same approach but
allowing ±100 kbp around the CNV consensus regions. Gene loci
coordinates are extracted from the annotation used in the single-cell
Multiome analysis.
Bulk RNA-seq analysis of SUM159PT and tumours
Reads were trimmed, filtered and aligned using STAR v2.7.3^[389]102.
Read count extraction and TPM normalisation were performed using
FeatureCounts. The TM4SF1^high signature, consisting of 433 DEGs, was
defined using edgeR (within Galaxy v3.36), considering the sorting
batch as a factor, and selecting genes with log[2](FC) > 1 and p
value < 0.05 (N = 3 independent sorting batches). The functional
analyses of TM4SF1^high DEGs were generated through the use of IPA
(QIAGEN Inc.). Differential gene expression analysis in tumours vs 2D
samples was performed with the Bioconductor Deseq2 package
v1.34^[390]103 using default parameters.
Analysis of gene signatures (S1, S2, S3) in bulk TCGA and METABRIC datasets
RNA expression data from primary bulk breast cancer patients were
retrieved from Cbio Cancer Genomics Portal^[391]104, selecting as
studies METABRIC and TCGA and as Genomic Profiles “mRNA expression”
(METABRIC: “mRNA expression z-scores relative to all samples (log
microarray)”; TCGA: “mRNA expression z-scores relative to all samples
(log RNA Seq V2 RSEM)”). Only complete samples were considered for this
analysis. Molecular subtyping information was retrieved from the Cbio
Portal. For each signature, the sum of the z-scores of all the
expressed genes (genes expressed in <50% of the samples were excluded
from the analysis) was calculated. Then, patients were stratified
according to score quartiles (1st quartile, S1-high; 2nd and 3rd
quartile, S1-mid; 4th quartile, S1-low; likewise for S2 and S3). The
significance of the association with SUM159PT-derived signatures was
calculated with a Chi-square test. Individual genes of the signatures
were also analysed through multivariate linear correlation and
visualised by a colour map of clustered correlations (K-means) using
JMP17.2 (SAS).
Comparison with single-cell sequencing datasets from the literature
Triple-negative breast cancer (TNBC) scRNA-seq
Concatenated scRNA-seq data from primary TNBC samples were retrieved
from [392]GSE161529^[393]50. Filtered cell-count matrices were
processed as in the section “Quality filtering and normalisation”.
Epithelial cells were defined by normalised EPCAM expression greater
than the local minimum between the top two local maxima in the
distribution. Cells with UMI count >20,000 were removed, resulting in
9063 epithelial cells for downstream analysis. Highly variable gene
detection, PCA, and clustering are performed with Seurat_preprocessing
function from Scissor 2.0.0^[394]105. We obtained nine clusters.
Pseudo-bulk positive and negative phenotypes are generated for each
subpopulation (S1, S2, S3) and its complement, respectively, on each
scRNA-seq sample at baseline separately (T0_1, T0_2, T1_1, T1_2),
resulting in n = 8 samples/phenotype pair (4 replicates/condition).
Scissor is run on the single-cell dataset for each of the three
phenotype pairs. Parameter alpha for the logit regression is let vary
in {0.7, 0.8, 0.9}, until ≤15% phenotype-associated cells (both
positive and negative) are detected (cutoff = 0.15). To quantify the
enrichment of S1, S2, and S3 signatures across clusters, we computed
the odds ratio for each signature–cluster pair. UMAP is run with
default parameters on the top 10 PCs.
Curated Cancer Cell Atlas (3CA)
We obtained the top 50 significant genes for each of the
meta-programmes (MP) associated to malignant cells^[395]51. Gene
symbols were first converted to synonyms to match the gene symbols in
the gene-cell expression table, using UpdateSymbolList function from
Seurat. We computed a joint gene expression for each gene list, using
the ModuleScore function. Finally, we computed the area under the curve
AUC(C,M), where the predictor variable is the ModuleScore associated to
the meta-programme M and the response variable is true when a cell
belongs to C, false otherwise.
Pancreatic ductal adenocarcinoma (PDAC) scRNA-seq
Association with subclonal dissemination for 2010 genes was retrieved
from Simeonov et al.^[396]20. To map S1 and S3 signature genes, murine
gene symbols were converted to human gene symbols with limma v3.49.5.
Lung adenocarcinoma scATAC-seq
Pre-metastatic gene scores for 20564 genes for Module 9 (RUNX2) were
retrieved from LaFave et al.^[397]52.
Statistics & reproducibility
No statistical method was used to predetermine the sample size, which
was comparable to that reported in previous studies. Regarding
single-cell experiments, the required number of individual single-cell
profiles was determined to capture a sizable portion of the total
number of GBCs, which was first assessed by bulk DNA sequencing.
Statistical significance was measured as indicated in the figure
legends. No data were excluded from the analysis performed in vivo. One
single-cell time-course batch (MULTI-seq) was removed due to poor
library quality. All experiments were replicated at least twice. The
experiments were not randomised, as the study was performed on uniform
biological material (i.e., a commercial cell line). For comparative in
vivo experiments (e.g., TM4SF1 high vs bulk), animals were allocated
randomly into the experimental groups. The investigators were not
blinded to allocation during experiments and outcome assessment for the
in vivo experiments (e.g., TM4SF1 high vs bulk).
Reporting summary
Further information on research design is available in the [398]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[399]Supplementary Information^ (4.8MB, pdf)
[400]Peer Review File^ (2.4MB, pdf)
[401]41467_2024_51424_MOESM3_ESM.pdf^ (39.8KB, pdf)
Description of Additional Supplementary Files
[402]Supplementary Data 1^ (4.6MB, xlsx)
[403]Supplementary Data 2^ (156.9KB, xlsx)
[404]Supplementary Data 3^ (124.6KB, xlsx)
[405]Supplementary Data 4^ (496.3KB, xlsx)
[406]Supplementary Data 5^ (350.6KB, xlsx)
[407]Supplementary Data 6^ (179.3KB, xlsx)
[408]Supplementary Data 7^ (5.4MB, xlsx)
[409]Supplementary Data 8^ (9.4MB, xlsx)
[410]Supplementary Data 9^ (115.2KB, xlsx)
[411]Supplementary Data 10^ (222KB, xlsx)
[412]Supplementary Data 11^ (1.1MB, xlsx)
[413]Supplementary Data 12^ (219KB, xlsx)
[414]Supplementary Data 13^ (547.1KB, xlsx)
[415]Supplementary Data 15^ (257.6KB, xlsx)
[416]Supplementary Data 14^ (454.1KB, xlsx)
[417]Supplementary Data 16^ (531.2KB, xlsx)
[418]Supplementary Data 17^ (129.4KB, xlsx)
[419]Supplementary Data 18^ (15.5KB, xlsx)
[420]Reporting Summary^ (3MB, pdf)
Source data
[421]Source Data^ (4.3MB, zip)
Acknowledgements