Abstract

   Claudin-low breast cancers are aggressive tumors defined by the low
   expression of key components of cellular junctions, associated with
   mesenchymal and stemness features. Although they are generally
   considered as the most primitive breast malignancies, their
   histogenesis remains elusive. Here we show that this molecular subtype
   of breast cancers exhibits a significant diversity, comprising three
   main subgroups that emerge from unique evolutionary processes. Genetic,
   gene methylation and gene expression analyses reveal that two of the
   subgroups relate, respectively, to luminal breast cancers and
   basal-like breast cancers through the activation of an EMT process over
   the course of tumor progression. The third subgroup is closely related
   to normal human mammary stem cells. This unique subgroup of breast
   cancers shows a paucity of genomic aberrations and a low frequency of
   TP53 mutations, supporting the emerging notion that the intrinsic
   properties of the cell-of-origin constitute a major determinant of the
   genetic history of tumorigenesis.

   Subject terms: Bioinformatics, Epigenetics analysis, Methylation
   analysis, Gene expression analysis, Breast cancer
     __________________________________________________________________

   Claudin-low tumors are a rare aggressive subtype of breast cancers. In
   this study, the authors use a multiomics approach to demonstrate that
   these tumors are heterogeneous and comprise three main subgroups that
   emerge from different evolutionary processes.

Introduction

   Breast cancer is a highly heterogeneous group of diseases with variable
   biological and clinical behaviors. Providing early insights into this
   diversity, gene-expression profiling analyses initially resulted in the
   identification of four clinically relevant molecular subtypes, known as
   intrinsic subtypes (luminal A, luminal B, HER2-enriched and basal-like,
   according to the PAM50 classification) mostly corresponding to hormone
   receptor and HER2 status, and a normal breast-like group^[48]1–[49]4.
   Among these intrinsic subtypes, the basal-like subtype appears to be
   the most distinct, as it is characterized by the unique expression of
   cytokeratins typically expressed by the basal layer of the skin and a
   very low level of expression of luminal-related genes^[50]2,[51]5. This
   observation led to the hypothesis that breast cancers may arise from
   the transformation of two distinct cell types of origin or
   developmental stages of mammary epithelial cell development, one
   generating basal-like tumors and the other non-basal-like
   malignancies^[52]5. Although appealing in its simplicity, this model
   overlooks the potential reprogramming of lineage-restricted populations
   initiated by an oncogenic event or a microenvironmental signal over the
   course of tumor development^[53]6–[54]8. Moreover, it does not account
   for the intrinsic diversity of each breast cancer subtype, notably
   basal-like tumors known for their great heterogeneity^[55]9–[56]11. An
   additional intrinsic subtype of breast cancers, known as claudin-low,
   has recently been identified in human and mouse tumors and in breast
   cancer cell lines, showing several common features with basal-like
   tumors and reflecting the diversity of tumors with a low luminal
   differentiation status^[57]4,[58]12. Basal-like and claudin-low tumors
   form the majority of triple-negative breast cancers (TNBCs), an
   aggressive subgroup of breast malignancies defined as tumors lacking
   expression of the estrogen receptor (ER), progesterone receptor (PR),
   and HER2. A hallmark of the claudin-low subtype is the low expression
   level of critical cell–cell adhesion molecules, including claudins 3,
   4, and 7, occludin, and E-cadherin. Tumors of this subtype are highly
   enriched in mesenchymal traits and stem cell features and are therefore
   considered as the most primitive breast cancers^[59]13. Although in
   vivo preclinical data suggested that basal-like tumors arise from the
   transformation of a luminal progenitor (LP)^[60]14, the putative
   cell-of-origin of claudin-low tumors remains unknown. The prevailing
   hypothesis is that, over the course of tumor progression, basal cancer
   cells undergo an epithelial-mesenchymal transition (EMT) in response to
   acquired oncogenic events and/or microenvironmental signals, thereby
   gaining mesenchymal and stemness features^[61]15–[62]17. An alternative
   hypothesis is that claudin-low tumors originate from an early
   epithelial precursor with inherent stemness features^[63]4,[64]13. To
   gain further insight into the developmental origin of claudin-low
   tumors, we first sought to analyze their genomic architecture. Indeed,
   we have recently demonstrated that, whereas the oncogene-driven
   transformation of mature luminal (mL) cells and progenitor luminal
   cells triggers massive oncogene-induced DNA damage and an early onset
   of chromosomal instability (CIN), normal human mammary stem cells
   (MaSCs) can withstand an aberrant mitogenic activity^[65]18. The
   endogenous expression of the ZEB1 EMT-inducing transcription factor
   prevents replication and oxidative stress, leading to a process of
   malignant transformation in the absence of exacerbated genomic
   instability^[66]18. As basal-like breast cancers generally exhibit
   numerous genomic aberrations, we thus speculated that the extent of
   genomic aberrations might be used as a molecular indicator of the
   developmental origin of claudin-low breast cancers.

   To comprehensively characterize the claudin-low breast tumor subtype,
   we used a multi-omics approach that reveal the existence of three
   distinct subgroups with specific transcriptomic, epigenetic, and
   genetic characteristics, strongly supporting the hypothesis of
   different cells-of-origin.

Results

Selection of claudin-low tumors

   Claudin-low tumors exhibit marked immune and stromal cell infiltration,
   when compared with all other breast cancer subtypes^[67]4,[68]19. As
   gene expression analyses do not allow to accurately discriminate
   mesenchymal cancer cells from normal stromal cells, we used the
   allele-specific copy number analysis of tumors (ASCAT) copy
   number-based tumor purity estimation method to assess tumor cell
   fraction^[69]20 (Supplementary Fig. [70]1a). To avoid any substantial
   bias due to non-tumor cell contamination, a stringent purity threshold
   was determined by applying Wilcoxon test, for which the degree of
   contamination of claudin-low tumors was not statistically different
   from that of non-claudin-low tumors, thus enabling a comparable
   distribution of tumors (Supplementary Fig. [71]1b). Using this
   stringent threshold, 45 out of 152 tumors classified as claudin-low
   from Molecular Taxonomy of Breast Cancer International Consortium
   (METABRIC) were selected for further studies. Demonstrating an
   unexpected degree of diversity, only 35.6% of these tumors were
   classified as TNBCs (Supplementary Fig. [72]1d). This result was not
   due to the purity selection, as the percentage of TNBCs within all
   claudin-low tumors was similar before applying the purity threshold
   (Supplementary Fig. [73]1c). Of note, only four normal-like tumors were
   estimated as pure. Due to the lack of statistical power, they were
   excluded from subsequent analyses.

Identification of CNA-devoid claudin-low tumors

   We next analyzed claudin-low tumors using the stratification based on
   gene expression and copy number alterations (CNAs), previously defined
   by Curtis et al.^[74]21,[75]22 (Fig. [76]1a). As expected for
   aggressive neoplasms, basal-like tumors mostly belonged to the
   integrative cluster 10, characterized by multiple CNAs affecting most
   chromosomes. Luminal A and B tumors were distributed across clusters 1,
   2, 3, 6, 7, 8, and 9, whereas HER2-positive tumors were mainly found in
   cluster 5. Interestingly, claudin-low tumors were stratified into three
   main clusters, again demonstrating a significant heterogeneity. Only
   17.8% were found in cluster 10, whereas 48.8% and 17.8% were found in
   integrative clusters 4 and 3, respectively, regardless of tumor purity
   when compared with non-claudin-low tumors (Supplementary Fig. [77]2).
   Both integrative clusters 3 and 4 are marked by a paucity of copy
   number and cis-acting alterations. Integrative cluster 3 displays a
   simplex pattern of rearrangements, dominated by whole arm gain of
   chromosome 1q and 16p and loss of 16q^[78]11,[79]21,[80]22. It was
   initially described as essentially composed of luminal breast tumors.
   Integrative cluster 4 includes both ER-positive and ER-negative cases.
   It was defined as a CNA-devoid subgroup, with an expression profile
   dominated by immune-related genes, leading to the hypothesis that it
   may be essentially composed of samples with a high infiltration of
   normal cells^[81]11. As the method used in the present study to select
   tumors with a high index of purity does not rely on gene expression
   patterns, it argues against this hypothesis. Nevertheless, to formally
   demonstrate the existence of claudin-low tumors with a flat genomic
   landscape, we analyzed the genomic architecture of tumor samples
   exhibiting a very low level of aberrations (fraction of genome altered,
   FGA < 1%). After having previously removed germline copy number
   variations (CNVs), we analyzed B allele frequency (BAF) and log R ratio
   (LRR) profiles of single-nucleotide polymorphisms (SNPs) localized in
   genomic regions of deletion (LRR segment mean cutoff value < −0.4)
   (Fig. [82]1b and Supplementary Fig. [83]3a, c). This analysis led to
   the identification of regions of single-copy allelic loss displaying a
   clear separation between BAF values (<0.8 and >0.2). As previously
   demonstrated, this observation is incompatible with a massive
   contamination of the tumor tissue by normal cells^[84]23. Consistent
   with these data, the analysis of somatic mutations highlighted the
   presence of several point mutations with a variant allele frequency
   higher than 0.4 (Fig. [85]1c and Supplementary Fig. [86]3b). These two
   results unequivocally demonstrate the existence of claudin-low tumors
   that develop in the absence of gross CIN.

Fig. 1. Claudin-low tumors are distributed across luminal-like IntClust3,
basal-like IntClust10, and CNA-devoid IntClust4, which harbors rare focal
genomic alterations.

   [87]Fig. 1
   [88]Open in a new tab

   a Repartition of breast tumors from the METABRIC cohort for each
   molecular subtype among integrative clusters. Number of tumors are
   represented on the y axis and the corresponding percentages are
   indicated above each bar. All tumors (n = 1266); claudin-low tumors
   (n = 45); basal tumors (n = 129); HER2 tumors (n = 102); luminal A
   tumors (n = 461); luminal B tumors (n = 429). b, c Identification of
   focal genomic alterations in a CNA-devoid IntClust4 claudin-low sample
   (MB-6052/FGA < 0.3%). b BAF and LRR plots of SNPs localized in genomic
   regions of copy number deletion (LRR < 0.4). c Mutation analysis from
   targeted sequencing data. BAF: B allele frequency; CNA: copy number
   alteration; FGA: fraction of genome altered; IntClust: integrative
   cluster; LRR: log R ratio; SNP: single-nucleotide polymorphism.

Genomic diversity of claudin-low tumors

   Although most claudin-low tumors are diploid (Fig. [89]2a), the extent
   of FGA was highly variable across tumors (Fig. [90]2b), confirming
   their intrinsic diversity^[91]24,[92]25. The use of Gaussian finite
   mixture models revealed a trimodal distribution, reflecting the
   existence of three subgroups of claudin-low tumors relative to their
   level of FGA (Fig. [93]2c, d). The subgroup 1 of claudin-low tumors
   (CL1), defined by a low FGA (<10%), was mostly composed of ER-negative
   tumors that were all stratified into integrative cluster 4
   (Fig. [94]2d–f). The subgroup 2 (CL2) showed an intermediate level of
   FGA (>10% and <30%), similar to that of luminal A breast tumors.
   Consistent with this finding, CL2 was mostly composed of ER-positive
   tumors mainly stratified in two luminal-related clusters, whereas 35%
   of CL2 tumors belonged to cluster 4. The subgroup 3 (CL3) displayed a
   high FGA (>30%), similar to the one found in basal-like breast cancers.
   Consistent with this notion, 33% of CL3 tumors were classified into the
   genomically unstable cluster 10. However, this subgroup was
   heterogeneous, with both ER-negative or ER-positive tumors, and a
   significant fraction of CL3 tumors were found in luminal-related
   clusters and in cluster 4. Of note, although these analyses were
   performed on tumors selected for their high tumor purity, similar data
   were obtained when studying the whole cohort of claudin-low tumors
   (Supplementary Fig. [95]4), further strengthening the characterization
   of the three claudin-low subgroups.

Fig. 2. Claudin-low tumors form a genomically and molecularly heterogeneous
subgroup.

   [96]Fig. 2
   [97]Open in a new tab

   a Ploidy for each molecular subtype of the METABRIC cohort. Claudin-low
   tumors are mainly diploid. b FGA% for each molecular subtype.
   Claudin-low tumors display overall low levels of genomic alterations. c
   Gaussian mixture model application on FGA distribution across
   claudin-low tumors reveals three CNA-related subgroups of tumors. Each
   bar on the x axis corresponds to one claudin-low tumor. d FGA% in
   claudin-low subgroups compared with other molecular subtypes. e
   Integrative clusters and f breast cancer receptor status distribution
   in each claudin-low subgroup. The CNA-devoid CL1 subgroup is associated
   with IntClust4 and TN status, whereas CL2 and CL3 are respectively
   related to luminal (luminal clusters and hormone receptor expression)
   and basal-like (IntClust10 and TN tumor enrichment) subtypes.
   Respective colors of luminal-related clusters are shown in Fig. [98]1.
   Boxplot: center line, median; box limits, upper and lower quartiles;
   whiskers, minimum to maximum; all data points are shown. Claudin-low
   tumors (n = 45); basal tumors (n = 129); HER2 tumors (n = 102); luminal
   A tumors (n = 461); luminal B tumors (n = 429); CL1 tumors (n = 10);
   CL2 tumors (n = 17); CL3 tumors (n = 18).CL: claudin-low; TN: triple
   negative.

Distinct gene expression signatures of claudin-low subgroups

   Gene-set enrichment analysis (GSEA) was next used to identify pathways
   differentially regulated among the three claudin-low subgroups, by
   comparing one with the other two (CL1 vs. CL2/3, CL2 vs. CL1/3, and CL3
   vs. CL1/2). More than 15,000 pathways, available in H (Hallmarks), C2
   (encompassing curated gene sets from literature and canonical pathways
   such as KEGG, REACTOME, or BIOCARTA), and C5 (comprising GO pathways)
   gene sets from Molecular Signature Database (MSigDB), were tested.
   These analyses highlighted as follows: (i) an enrichment in stem
   cell-related signatures in CL1, (ii) an enrichment in luminal-related
   signatures in CL2, and (iii) an enrichment in cell cycle-related
   pathways in CL3 tumors (Fig. [99]3a). Of note, global gene expression
   signatures demonstrated that the CL2 and CL3 subgroups showed lower
   luminal- and basal-related signature scores, respectively, compared
   with luminal- and basal-like breast tumors, while exhibiting a
   significant enrichment in invasiveness-related signatures
   (Supplementary Fig. [100]5a, b). We then performed single-sample GSEA
   (ssGSEA) analyses to determine the differentiation profile of the three
   claudin-low subgroups. Gene expression signatures for MaSCs, LPs, and
   mL cells were generated based on the transcription profiles of
   subpopulations of normal human mammary epithelial cells isolated from
   reduction mammoplasty tissues, as shown in Morel et al.^[101]18
   (Supplementary Data [102]1). Consistent with previous GSEA analysis,
   CL1 tumors showed a strong enrichment in the MaSC signature, whereas
   CL2 tumors exhibited a mL signature (Fig. [103]3b). Again, CL3 tumors
   showed a significant heterogeneity with a trend toward the LP
   signature. Notably, very similar data were obtained when using an
   alternative set of MaSC, LP, and mL signatures^[104]14 (Fig. [105]3c).

Fig. 3. Claudin-low subgroups show distinct gene expression signatures
relative to their differentiation status.

   [106]Fig. 3
   [107]Open in a new tab

   a GSEA comparing global gene expression of each claudin-low subgroup
   from the METABRIC cohort to the two others. The three pathways with the
   highest and lowest enrichment scores are represented (>15,000 tested
   pathways—NES ranking). b, c Gene expression analysis for each molecular
   subtype of (b) MaSC1/2/3, LP, and mL1/2 transcriptomic signatures
   (generated from Morel et al.^[108]18 data) and (c) previously published
   MaSC, LP, and mL transcriptomic signatures (from Lim et al.^[109]14
   publication). The CL1 subgroup displays strong stemness features,
   whereas CL2 and CL3 present luminal- and basal-like transcriptomic
   characteristics, respectively. Wilcoxon tests. Boxplot: center line,
   median; box limits, upper and lower quartiles; whiskers, minimum to
   maximum; all data points are shown. Basal tumors (n = 129); HER2 tumors
   (n = 102); luminal A tumors (n = 461); luminal B tumors (n = 429); CL1
   tumors (n = 10); CL2 tumors (n = 17); CL3 tumors (n = 18). GSEA:
   gene-set enrichment analysis; LP: luminal progenitor; MaSC: mammary
   stem cell; mL: mature luminal; NES: normalized enrichment score.

Identification of a gene expression-based classifier

   We next generated a gene expression-based classifier by using the
   nearest shrunken centroid method^[110]26, with the objective of
   discriminating claudin-low subgroups. A claudin-low gene list of 137
   genes discriminating the 3 claudin-low subgroups was thus created
   (Supplementary Fig. [111]6a, b). This expression-based classifier
   showed an accuracy of 91% when compared with the FGA-based
   classification using Gaussian finite mixture models (Supplementary
   Fig. [112]5c, d). The classifier was then applied to claudin-low
   samples from The Cancer Genome Atlas Network (TCGA;
   [113]https://www.cancer.gov/about-nci/organization/ccg/research/structu
   ral-genomics/tcga) and the Cancer Cell Line Encyclopedia
   (CCLE^[114]27), and the level of FGA was used as a readout to verify
   the characteristics of each claudin-low subgroup. Validating the
   accuracy of our classifier, CL1, CL2, and CL3 tumors and cell lines
   showed low, moderate, and elevated FGA levels, respectively
   (Supplementary Fig. [115]6e, f). Of note, we performed a pathway
   analysis of CL1 discriminant genes that revealed an enrichment in stem
   cell and pediatric cancer markers (Supplementary Fig. [116]6g).

Distinct gene methylation profiles of claudin-low subgroups

   Differentially methylated gene analysis was next performed in
   claudin-low subgroups followed by pathway-enrichment analysis, using
   over 15,000 pathways available in MSigDB (HALLMARKS, C2, and C5 gene
   sets) (Supplementary Fig. [117]7). When compared with CL2 and CL3, the
   CL1 subgroup presented 40 differentially methylated genes, and among
   them 75% were hypomethylated and enriched in stemness and EMT markers
   (Fig. [118]4a). The CL2 subgroup encompassed 68 differentially
   methylated genes compared with CL1 and CL3, and among them 88% were
   hypermethylated and enriched in basal-like and EMT pathways
   (Fig. [119]4b). Finally, the CL3 subgroup was characterized by 219
   differentially methylated genes compared with CL1 and CL2; among them,
   58% were hypermethylated and related to luminal pathways and 42%
   presented a hypomethylation profile and were associated with basal-like
   pathways (Fig. [120]4c). To correlate these distinct gene methylation
   profiles with gene expression, we performed a gene expression analysis
   of stemness and EMT gene markers in all three claudin-low subgroups and
   in the intrinsic subtypes of breast cancers from METABRIC, TCGA, and
   CCLE cohorts. The CL1 subgroup had the highest score for stemness-
   (ALDH1A1, PROCR) and EMT- (mesenchymal markers: ZEB1, VIM, CDH2;
   epithelial markers: EPCAM, CDH1) related genes, whereas CL2 and CL3
   displayed an intermediate score compared with CL1 tumors and with their
   relative luminal/basal counterparts (Supplementary Fig. [121]8).

Fig. 4. Claudin-low subgroups show distinct gene methylation profiles
relative to their differentiation status.

   [122]Fig. 4
   [123]Open in a new tab

   Heatmaps and pathway-enrichment analysis of differentially methylated
   genes between a CL1 vs. CL2 and CL3 tumors from TCGA, b CL2 vs. CL1 and
   CL3 tumors from TCGA, and c CL3 vs. CL1 and CL2 tumors from TCGA. CL1
   tumors mainly display hypomethylated genes that are related to a
   stemness phenotype, CL2 tumors are overall hypermethylated for genes
   associated with basal-like features, and CL3 tumors show more
   heterogeneous methylation profiles with both hypermethylated genes
   (linked to luminal phenotype) and hypomethylated genes (relative to
   basal-like characteristics) (see Supplementary Fig. [124]5 for
   methodology). Clustering method: Ward’s; distance: Spearman. FDR: false
   discovery rate.

Distinct oncogenic pathways in claudin-low subgroups

   To characterize the biological pathways driving the development of the
   three subgroups of claudin-low tumors, we determined, for each breast
   tumor sample from METABRIC and TCGA cohorts, ssGSEA scores for all
   MSigDB hallmark gene-set signatures. We then conducted an unsupervised
   clustering based on the median score of each of the 48 biological
   processes computed per breast molecular subtype (Fig. [125]5).
   Interestingly, in both datasets, the three claudin-low subgroups
   clustered together in a specific branch of the dendogram, which further
   discriminated CL1 from the CL2/CL3 tumors, highlighting the specific
   biological processes distinguishing claudin-low from non-claudin-low
   and CL1 from the CL2/CL3 tumors. Furthermore, unsupervised pathway
   clustering identified specific biological functions differentially
   enriched in claudin-low, basal, and luminal breast tumors. Among the
   pathways specifically enriched in all claudin-low subgroups, some of
   them were linked to immunity function, supporting the hypothesis of
   privileged infiltration by immune cells within the microenvironment of
   claudin-low tumors^[126]4,[127]19. Additional gene expression analyses
   performed on METABRIC and TCGA cohorts using two different cellular
   signature deconvolution algorithms confirmed that these tumors were
   highly infiltrated by immune cells (Supplementary Fig. [128]9a–d).
   Nevertheless, unsupervised clustering of microenvironment signatures
   (immune and non-immune cell subsets) did not allow to discriminate the
   different claudin-low subgroups (Supplementary Fig. [129]9e–h). The
   additional pathways specifically enriched in all claudin-low subgroups
   included the TP53-dependent pathway, the mitogen-activated protein
   kinase (RASMAPK) signaling pathway and a differentiation/EMT-related
   pathway (Fig. [130]5). CL1 presented the highest enrichment in
   claudin-low-related pathways, when compared with all other subtypes. Of
   note, the CL2 subgroup displayed a concomitant enrichment in
   claudin-low and luminal pathways, whereas the CL3 subgroup was enriched
   in claudin-low and basal pathways (Fig. [131]5).

Fig. 5. Claudin-low subgroups show distinct activated biological pathways.

   [132]Fig. 5
   [133]Open in a new tab

   Heatmaps of ssGSEA scores for each MSigDB hallmark gene-set signature
   (median by molecular subtype) for a METABRIC and b TCGA datasets. ASCAT
   median purity score of each subtype is indicated on the top heat map
   annotation. Claudin-low tumors cluster together in a specific branch of
   the dendrogram downstream distinguishing CL1 from CL2/CL3 tumors.
   Hallmark gene-set cluster in three main groups of pathways related to
   luminal, basal, and claudin-low tumors. ASCAT: allele-specific copy
   number analysis of tumors; MSigDB: molecular signature database;
   ssGSEA: single-sample GSEA. Clustering method: Complete; distance:
   Euclidean.

   To validate these observations, we generated ssGSEA scores for the
   pathways previously identified in Fig. [134]5 (EMT, RAS/MAPK,
   proliferation, estrogen, and DNA damage response (DDR)), using MSigDB
   C2 curated gene sets (KEGG, GO, REACTOME…). The same pathways were also
   analyzed at the protein level using reverse-phase protein array (RPPA)
   data from TCGA^[135]28. Supporting our previous results, CL2 and CL3
   showed, among the three claudin-low subgroups, the highest score for
   luminal- (estrogen-related response) and basal-related (proliferation
   and DDR) signatures, respectively (Supplementary Fig. [136]10).
   Nevertheless, statistical analyses revealed a significantly attenuated
   phenotype compared with their related molecular subtype. Furthermore,
   both CL2 and CL3 showed an activation of the EMT process (Fig. [137]6a,
   b, g), coherent with the downregulation of hormone (in CL2) and
   proliferation/DDR (in CL3) signatures, compared with luminal and basal
   tumors, respectively (Supplementary Fig. [138]10 and Fig. [139]6g).
   Moreover, CL1 displayed a low proliferation activity and a low DDR
   (Supplementary Fig. [140]10a, b, e, f), as well as a strong TP53
   signaling pathway signature associated with a low frequency of TP53
   mutations (Fig. [141]6c, d, g). Finally, CL1 exhibited a high
   activation of the RAS-MAPK signaling pathway, a feature also found in
   CL2 and CL3, although at a lower level (Fig. [142]6e–g). Consistent
   with these findings, claudin-low cell lines were significantly more
   sensitive to MEK inhibitors (MEKi) than non-claudin-low cell lines
   (Fig. [143]6h).

Fig. 6. Claudin-low subgroups jointly display high EMT features and MAPK
pathway activation.

   [144]Fig. 6
   [145]Open in a new tab

   a, c, e ssGSEA score distribution for each molecular subtype, from
   METABRIC and TCGA cohorts, related to a EMT, c TP53, and e RAS/MAPK
   pathways. d TP53 somatic mutation distribution across breast molecular
   subtypes of METABRIC and TCGA cohorts. b, f Protein-based pathway
   scores (RPPA data) for each molecular subtype from TCGA breast tumors
   related to b EMT and f RAS/MAPK. g Principal component analysis (PCA)
   of TCGA breast tumors according to EMT, RAS/MAPK, cell
   cycle/proliferation, hormone-related and DDR pathways at RNA (ssGSEA
   scores), and protein (RPPA scores) levels. For each sample, molecular
   subtype is indicated in color and variables (pathways) are highlighted
   in black squares. h IC50 distribution according to molecular subtype
   for three different MEK inhibitors available in the GDSC database.
   Claudin-low cell lines commonly exhibit higher sensitivity to MAPK
   inhibitors than other breast molecular subgroups. Wilcoxon tests.
   Boxplot: center line, median; box limits, upper and lower quartiles;
   whiskers, minimum to maximum; all data points are shown. TP53 signaling
   pathway signature: GO positive regulation of signal transduction by p53
   class mediator; MAPK signaling signature: KEGG MAPK signaling pathway.
   DDR: DNA damage response; EMT: epithelial-mesenchymal transition; RPPA:
   reverse-phase protein array.

   Altogether, our data strongly support the notion that the CL1 subgroup
   of claudin-low tumors was related to normal MaSCs, CL2 to normal mL
   cells and to luminal breast cancers, and CL3 to basal-like breast
   cancers. Unstratified claudin-low tumors have been previously
   associated with poor prognosis^[146]4,[147]29. When stratified by CL
   subgroups, variations in survival outcome were found, consistent with
   their presumed origin (Supplementary Fig. [148]11). Indeed, the
   basal-like-related CL3 subgroup of claudin-low tumors was associated
   with low overall and disease-free survivals, as compared with the
   luminal-related CL2 subgroup. In line with their low proliferation
   index and their low frequency of TP53 mutations, tumors of the CL1
   subgroup were associated with a favorable prognosis.

Discussion

   Here we have characterized the intrinsic diversity within the
   claudin-low breast cancers by demonstrating the existence of three main
   molecular subgroups. Consistent with the data from a recent
   study^[149]30, published during the revision process of our manuscript,
   we have shown that these subgroups are associated with distinct
   survival outcomes. Moreover, our study has delved into the underlying
   reasons of the observed heterogeneity, revealing that claudin-low
   tumors exhibit different developmental origins.

   Supporting the prevailing hypothesis that some of these tumors are
   generated from basal-like breast cancers, the comprehensive
   characterization of METABRIC and TCGA databases have revealed the
   existence of claudin-low breast cancers (CL3 subgroup) with
   characteristics of basal mammary epithelial cells and of basal-like
   breast cancers. These traits include the hypomethylation of genes
   related to basal-related pathways, the hypermethylation of genes
   related to luminal-related pathways, a high level of expression of
   proliferation-related genes and a strongly disturbed genomic landscape.
   Several experimental data substantiate the hypothesis of the basal
   origin of claudin-low breast cancers, including the demonstration that
   the oncogenic transformation of basal mammary epithelial cells and of
   LPs can generate malignant cells with mesenchymal characteristics
   through the activation of an EMT process^[150]17,[151]31. However,
   highlighting a first degree of heterogeneity, a significant fraction of
   claudin-low tumors (CL2 subgroup) displays a gene expression signature
   of mL cells and shares common genetic, epigenetic and transcriptomic
   features with the luminal A intrinsic subtype. Although unexpected,
   this finding is reminiscent of the experimental observation that EMT
   induction in luminal breast cancer cell lines confers them with a
   claudin-low expression pattern^[152]32. A second degree of
   heterogeneity is shown by the CL1 subgroup of claudin-low breast
   cancers. Among the three subgroups of claudin-low tumors, CL1 appears
   as the most distinct, with differential transcriptomic, epigenetic, and
   genetic traits that comprise a prominent stemness signature and a
   paucity of genomic aberrations. Although TNBCs with negligible CNAs
   were previously reported^[153]21,[154]22, their existence was
   questioned due to the marked immune and stromal cell infiltration
   attributed to claudin-low tumors^[155]11. Here, the identification of
   focal genetic alterations in tumors selected for their purity
   unequivocally demonstrates the existence of CNA-devoid claudin-low
   tumors. Beyond their stemness characteristics and their low CIN, CL1
   tumors have additional characteristic features when compared to all
   other breast cancers. These traits include (i) a high expression of the
   ZEB1 EMT-inducing transcription factor and of its target, the
   methionine sulfoxide reductase MSRB3; (ii) a frequent activation of the
   RAS-MAPK signaling pathway; (iii) a low activation of the DDR; and (iv)
   a low frequency of TP53 mutations (Supplementary Figs. [156]8 and
   [157]10, and Fig. [158]6). These findings are highly consistent with
   the stemness features of CL1 tumors. Indeed, we demonstrated recently
   that normal human MaSCs exhibit a pre-emptive antioxidant program
   driven by ZEB1 and MSRB3, providing them with the unique capacity to
   readily adapt to an aberrant activation of the RAS-MAPK
   pathway^[159]18. As a consequence, RAS-driven malignant transformation
   of MaSCs occurs in the absence of DNA damage, thereby alleviating the
   selective pressure on the inactivation of the TP53-dependent failsafe
   program^[160]8,[161]18. Altogether, these findings led us to propose a
   model with two alternative paths leading to the development of
   claudin-low tumors (Fig. [162]7). A direct path relies upon the
   malignant transformation of a normal MaSC, leading to undifferentiated
   tumors with a low genomic instability and a low frequency of TP53
   mutations. An indirect path relies upon the activation of an EMT
   process, over the course of tumor progression. Beyond a certain extent
   of transdifferentiation, the gain of mesenchymal features triggers the
   conversion into a gene expression signature of claudin-low tumor
   subtype. When it occurs in a basal-like tumor, EMT promotes the
   evolution toward a claudin-low tumor with extensive genomic aberrations
   as the reflect of previous periods of exacerbated CIN. When this
   process occurs in a luminal breast cancer, it leads gradually to a
   claudin-low tumor with a moderate level of genomic aberrations.

Fig. 7. The diversity of claudin-low tumors reflects the impact of the
cell-of-origin on the evolutionary process of breast tumorigenesis.

   [163]Fig. 7
   [164]Open in a new tab

   The malignant transformation of a normal MaSC leads to undifferentiated
   tumors with a low genomic instability and a low frequency of TP53
   mutations. These features are characteristic of the CL1 subgroup. The
   activation of an EMT transdifferentiation process and the gain of
   mesenchymal features in a basal-like tumor promotes the development of
   a CL3 tumor with extensive genomic aberrations. EMT commitment in a
   luminal-like breast cancer leads gradually to a CL2 tumor with a
   moderate level of genomic aberrations.

   Overall, this model illustrates the emerging concept of cellular
   pliancy^[165]8. The level of pliancy is defined by the intrinsic
   susceptibility of a discrete cell state to undergo malignant
   transformation or degeneration after sustaining a particular oncogenic
   insult^[166]8,[167]33. In this regard, the low genomic instability of
   CL1 tumors might thus reflect the high pliancy of human MaSCs for RAS
   activation. Interestingly, although CL1 tumors show the most frequent
   activation of the RAS-MAPK pathway, it is noteworthy that the
   activation of this mitogenic pathway is also a common event in CL2 and
   CL3 tumors, compared with other breast cancer subtypes. This latter
   finding may reflect the capacity of the RAS signaling pathway to induce
   EMT in permissive conditions^[168]34–[169]36. Implicit in this notion
   is that the activation of the RAS-MAPK pathway may be a primary
   oncogenic event in the tumorigenic process leading to CL1 tumors and a
   secondary event in the course of tumor progression in a fraction of
   basal-like and luminal breast cancers, eventually leading to CL2 and
   CL3 tumors.

   Although this hypothesis remains to be tested, our data may have
   significant clinical implications in light of the previously described
   resistance of EMT-related tumors to a variety of anti-cancer
   therapies^[170]37–[171]39. Indeed, the identification of the recurrent
   activation of the RAS-MAPK pathway in claudin-low breast cancers may
   lead to new therapeutic opportunities that need to be tested in
   preclinical models.

Methods

Samples

   Breast tumor samples used in this study were from METABRIC^[172]21 and
   TCGA Research Network ([173]https://www.cancer.gov/tcga) cohorts, and
   breast cancer cell lines were from CCLE database^[174]27.

Statistics

   All analyses and statistical tests were carried out with the R software
   (version 3.6.1)^[175]40. Heatmaps were generated with ComplexeHeatmap,
   principal component analysis was completed with ade4, Gaussian finite
   models were performed with mclust and Rmixmod, and survival analyses
   were conducted with survival R packages^[176]25,[177]41–[178]43. All
   statistical tests were two-tailed. Figures were created using either
   the R software or GraphPad Prism 8.0 (GraphPad Software, Inc., San
   Diego, USA).

Expression data processing

   METABRIC microarray expression data from discovery and validation sets
   were extracted from the EMBL–EBI archive (EGA,
   [179]http://www.ebi.ac.uk/ega/; accession number: EGAS00000000083)
   (Normalized expression data files)^[180]21. Normalized expression data
   per probe of the discovery set and the validation set were combined
   after normalization of each set independently with a median Z-score
   calculation for each probe. The expression levels of different probes
   associated with the same Entrez Gene ID were averaged for each sample
   in order to obtain a single expression value by gene.

   The Cancer Genome Atlas Breast Invasive Carcinoma (TCGA BRCA) RNA
   sequencing (RNASeq) expression data were extracted as FPKM (fragments
   per kilobase of transcript per million mapped reads) values from the
   Genomic Data Commons (GDC) data portal
   ([181]https://portal.gdc.cancer.gov/). FPKM data by gene were converted
   to transcripts per kilobase million (TPM) as follows: for each gene
   g ∈ G and each sample s ∈ S,
   [MATH: <mi mathvariant="normal">TPM</mi><mfenced close=")" open="("
   separators=""><mrow><mi>g</mi><mo>,</mo><mi>s</mi></mrow></mfenced><mo>
   =</mo><mfenced close=")" open="(" separators=""><mrow><mfrac><mrow><mi
   mathvariant="normal">FPKM</mi><mrow><mo>(</mo><mrow><mi>g</mi><mo>,</mo
   ><mi>s</mi></mrow><mo>)</mo></mrow></mrow><mrow><munderover
   accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>G</mi></mrow></munderover><mi
   mathvariant="normal">FPKM</mi><mrow><mo>(</mo><mrow><mi>i</mi><mo>,</mo
   ><mi>s</mi></mrow><mo>)</mo></mrow></mrow></mfrac></mrow></mfenced><mo>
   ×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mn>6</mn></mrow></m
   sup> :MATH]

   RNASeq expression data from the CCLE breast cell lines were extracted
   as RPKM values from the CCLE data portal
   ([182]https://portals.broadinstitute.org/ccle). RPKM data by gene were
   converted to TPM as follows: for each gene g ∈ G and each sample s ∈ S,
   [MATH: <mi mathvariant="normal">TPM</mi><mfenced close=")" open="("
   separators=""><mrow><mi>g</mi><mo>,</mo><mi>s</mi></mrow></mfenced><mo>
   =</mo><mfenced close=")" open="(" separators=""><mrow><mfrac><mrow><mi
   mathvariant="normal">RPKM</mi><mrow><mo>(</mo><mrow><mi>g</mi><mo>,</mo
   ><mi>s</mi></mrow><mo>)</mo></mrow></mrow><mrow><munderover
   accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>G</mi></mrow></munderover><mi
   mathvariant="normal">RPKM</mi><mrow><mo>(</mo><mrow><mi>i</mi><mo>,</mo
   ><mi>s</mi></mrow><mo>)</mo></mrow></mrow></mfrac></mrow></mfenced><mo>
   ×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mn>6</mn></mrow></m
   sup> :MATH]

Molecular breast cancer subtype assignment

   Triple-negative (TN) status was determined from clinical data obtained
   through the Synapse platform ([183]https://www.synapse.org/)
   (syn1757053) for METABRIC dataset and the GDC data portal
   ([184]https://portal.gdc.cancer.gov/) for TCGA dataset. Expression
   value for estrogen and progesterone receptors were combined with the
   SNP6 state for ERBB2 gene, to define the TN status of each available
   tumor.

   Integrative cluster and breast cancer molecular subtype attribution
   were performed using the R package genefu^[185]44. Basal-like, luminal
   A, luminal B, Her2, and normal-like subtype assignments were computed
   from five different algorithms (PAM50, AIMS, SCMGENE, SSP2006, and
   SCMOD2)^[186]45–[187]49. An assignment was considered final if defined
   by at least three different algorithms. In case of divergence between
   classifiers, PAM50 subtype attribution was used.

   The claudin-low subtype classification was defined by the nearest
   centroid method. For that, the Euclidean distance between each tumor
   sample (from METABRIC and TCGA cohorts) and the previously described
   claudin-low and non-claudin-low centroids for tumor samples were
   determined, using the 1667 genes defined by Prat et al.^[188]4 as
   significantly differentially expressed between claudin-low tumors and
   all other molecular subtypes. The tumor-related centroids
   (differentially expressed genes between claudin-low and non-claudin-low
   tumors using significance analysis of microarrays) were used rather
   than the cell-line-related centroids (nine cell lines), as cell culture
   dramatically changes tumor phenotype and thus transcriptomic signature
   of primary samples. The use of a high-purity threshold for tumor
   selection allowed to circumvent the contamination bias related to the
   tumor-related centroid (Supplementary Fig. [189]1). For CCLE breast
   cancer cell lines, claudin-low status assignment was performed using
   the nine-cell-line predictor via the R package genefu^[190]44.

Pathway enrichment analysis

   All pathway-enrichment analyses were conducted using MSigDB gene sets
   H, C2, and C5 from msigdbr R package^[191]50. GSEAs were carried out
   using fgsea R package^[192]51. Gene lists were pre-ranked using
   Signal2Noise metric. ssGSEA scores were computed through gsva R
   package^[193]52. Pan-cancer transcriptomic EMT signature, defined by
   Tan et al.^[194]53, was used to compute EMT score for each sample.

Normal mammary cell transcriptomic signatures

   Previously published microarray expression data were used to generate
   lists of differentially expressed genes along the mammary epithelial
   cell hierarchy (MaSC1/2/3, LP, and mL1/2)^[195]18. Microarray data were
   robust multiarray average-normalized through oligo R package and
   differential expression analysis was performed using limma R
   package^[196]54,[197]55. The top 500 false discovery rate (FDR) p-value
   ranked genes differentially expressed between one subpopulation (MaSC
   or LP, or mL cells) vs. the two others (FDR p-value < 0.05) were used
   as transcriptomic signature for each of the three cell subpopulations
   (Supplementary Data [198]1).

Claudin-low subgroups classifier

   Expression-based classifier for the three claudin-low subgroups was
   identified using shrunken nearest centroid method through pamr R
   package^[199]26. A 1.87 threshold for centroid shrinkage was defined
   after examination of training errors and the cross-validation results.
   Finally, the nearest shrunken centroid classifier encompassed 137 genes
   whose expression discriminate the three FGA-related claudin-low
   subgroups in METABRIC cohort (Supplementary Fig. [200]6).

Microenvironment analysis

   Estimation of immune and non-immune cell fractions from tumor
   microenvironment were determined through gene expression analysis using
   Immunedeconv R package^[201]56 together with Xcell and MCPCounter
   deconvolution methods^[202]57,[203]58.

Copy number data processing

   METABRIC segmented copy number data from discovery and validation sets
   were extracted from the EGA ([204]http://www.ebi.ac.uk/ega/; accession
   number: EGAS00000000083) (Segmented (CBS) copy number aberrations (CNA)
   files)^[205]21. TCGA BRCA segmented copy number data were extracted
   from the GDC data portal repository (files corresponding to alignments
   on the hg19 version of the human genome without germline CNV were
   chosen). FGAs was evaluated from TCGA and METABRIC segmented copy
   number data (both generated from Affymetrix SNP6.0 arrays) as follows:
   [MATH: <mi
   mathvariant="normal">FGA</mi><mo>=</mo><mfrac><mrow><munder><mrow><mo>∑
   </mo></mrow><mrow><mi
   mathvariant="normal">CN</mi><mi>i</mi><mo>></mo><mi
   mathvariant="normal">WM</mi><mo>+</mo><mi>T</mi></mrow></munder><mi>L</
   mi><mfenced close=")" open="("
   separators=""><mrow><mi>i</mi></mrow></mfenced></mrow><mrow><mfenced
   close=")" open="(" separators=""><mrow><mo
   mathsize="big">∑</mo><mi>L</mi><mfenced close=")" open="("
   separators=""><mrow><mi>i</mi></mrow></mfenced></mrow></mfenced></mrow>
   </mfrac><mo>+</mo><mfrac><mrow><munder><mrow><mo>∑</mo></mrow><mrow><mi
   mathvariant="normal">CN</mi><mi>i</mi><mo><</mo><mi
   mathvariant="normal">WM</mi><mo>−</mo><mi>T</mi></mrow></munder><mi>L</
   mi><mfenced close=")" open="("
   separators=""><mrow><mi>i</mi></mrow></mfenced></mrow><mrow><mfenced
   close=")" open="(" separators=""><mrow><mo
   mathsize="big">∑</mo><mi>L</mi><mfenced close=")" open="("
   separators=""><mrow><mi>i</mi></mrow></mfenced></mrow></mfenced></mrow>
   </mfrac> :MATH]

   For each segment i, CN[i] is the mean LRR along segment i, L(i) is the
   length of segment i, WM is the weighted median of CN[i] by L(i) for
   each sample I, and T is the threshold value of the CN[i] above which
   the segments are considered to be altered. In other words, FGA is the
   ratio of the sum of the lengths of all segments with signal above the
   threshold to the sum of all segment lengths. For CCLE cell lines
   analysis, T was set as 0.2, whereas for METABRIC and TCGA tumors
   analysis, T was set as 0.1.

   ASCAT ploidy and purity estimates were extracted from COSMIC data
   repository ([206]https://cancer.sanger.ac.uk/cosmic/)^[207]20. To avoid
   any substantial bias due to non-tumor cell contamination, tumors
   without available estimation of ASCAT aberrant tumor cell fraction were
   removed from the whole cohort. Furthermore, a stringent purity
   threshold (ASCAT purity > 0.38) was determined, by applying Wilcoxon
   test, to select purest tumor samples from TCGA and METABRIC databases
   when applicable.

Methylation data processing

   Normalized methylation β-values per gene were extracted from cBioPortal
   ([208]http://www.cbioportal.org/)^[209]59,[210]60. As previously
   described for genome-wide expression–methylation quantitative trait
   loci analysis^[211]61, genes for which methylation β-values were
   negatively correlated with expression values (Spearman’s ρ < 0 and FDR
   p-value < 0.05) were selected for differential methylation analysis
   (Supplementary Fig. [212]7).

RPPA data processing

   RPPA level 4 data were extracted from the cancer proteome atlas portal
   ([213]https://tcpaportal.org/tcpa/download.html)^[214]62. RPPA pathway
   lists, defined by Akbani et al.^[215]28, were used to compute protein
   pathway scores for each sample.

Somatic mutation data

   METABRIC somatic mutation data from targeted sequencing were obtained
   from Pereira et al.^[216]63
   ([217]https://github.com/cclab-brca/mutationalProfiles/tree/master/Data
   ) and TCGA somatic mutation data from whole-exome sequencing were
   obtained from Ellrott et al.^[218]64. Only somatic mutations annotated
   as exonic, in-frame, and non-silent were used for mutation analysis.

Drug response data

   Predicted IC50 data for MEKi (Trametinib, Selumetinib, Refametinib)
   were download from Genomics of Drug Sensitivity in Cancer (GDSC)
   database ([219]https://www.cancerrxgene.org/)^[220]65.

Clinicopathological analysis

   Complete clinical data were obtained through the Synapse platform
   ([221]https://www.synapse.org/) (syn1757053) for METABRIC dataset.

   For TCGA dataset, survival data were extracted from cBioPortal (TCGA
   BRCA, PanCancer Atlas) ([222]http://www.cbioportal.org/) and other
   clinical data were obtained from the GDC data portal
   ([223]https://portal.gdc.cancer.gov/)^[224]59,[225]60.

Reporting summary

   Further information on research design is available in the [226]Nature
   Research Reporting Summary linked to this article.

Supplementary information

   [227]Supplementary Information^ (8MB, pdf)
   [228]Peer Review File^ (1.2MB, pdf)
   [229]41467_2020_17249_MOESM3_ESM.docx^ (12.8KB, docx)

   Description of Additional Supplementary Files
   [230]Supplementary Data 1^ (165.3KB, xlsx)
   [231]Reporting Summary^ (1.8MB, pdf)

Acknowledgements