Abstract
We studied the transcriptome landscape of skin cutaneous melanoma
(SKCM) using 103 primary tumor samples from TCGA, and measured the
expression levels of both protein coding genes and non-coding RNAs
(ncRNAs). In particular, we emphasized pseudogenes potentially relevant
to this cancer. While cataloguing the profiles based on the known
biotypes, all the employed RNA-Seq methods generated just a small
consensus of significant biotypes. We thus designed an approach to
reconcile the profiles from all methods following a simple strategy: we
selected genes that were confirmed as differentially expressed by the
ensemble predictions obtained in a regression model. The main
advantages of this approach are: 1) Selection of a high-confidence gene
set identifying relevant pathways; 2) Use of a regression model whose
covariates embed all method-driven outcomes to predict an averaged
profile; 3) Method-specific assessment of prediction power and
significance. Furthermore, the approach can be generalized to any
biological system for which noisy RNA-Seq profiles are computed. As our
analyses concerned bio-annotations of both high-quality protein coding
genes and ncRNAs, we considered the associations between pseudogenes
and parental genes (targets). Among the candidate targets that were
validated, we identified PINK1, which is studied in patients with
Parkinson and cancer (especially melanoma).
Introduction
Melanoma, the deadliest of skin cancers, is a malignant and highly
metastatic tumor with growing incidence of about 20 per 100,000
individuals in Western countries^[34]1. In 10% of melanoma cases, there
is a positive family history, i.e. a risk factor significantly
contributing to disease occurrence^[35]2. At a genetic level, multiple
genes are naturally expected to interactively exert direct and/or
indirect influences on many biological processes and pathways,
suggesting possible signatures of the disease. One of the first
discovery efforts in melanoma transcriptome studies revealed the
emergence of genomic rearrangements behind novel gene fusions and
identified somatic mutations responsible for disease progression^[36]3.
Further RNA-Seq analyses have provided in-depth screening for a variety
of transcript structures, by studying human samples^[37]4 or cell
lines^[38]5. RNA-Seq is the standard technique that is used to
characterize cancer transcriptomes while measuring known transcripts
and predicting novel transcripts^[39]6–[40]11. Additionally, RNA-Seq
expands the detection power of transcriptome profiles by identifying
and elucidating a variety of non-coding RNAs (ncRNAs)^[41]12,[42]13.
Following the seminal findings of two large-scale projects,
ENCODE^[43]14 ([44]https://www.encodeproject.org/) and GENCODE^[45]15
([46]http://www.gencodegenes.org/), the general attention has been
gradually diverted away from genes and concentrated over other
biotypes, such as long intergenic ncRNAs (lincRNAs), antisense RNAs,
pseudogenes etc. In most cases the identification of potentially
transcribed bio-entitities was inferred from locus-specific (EST) and
wide-spectrum experiments, then referring transcriptional activity to
tissue specificity and establishing association with active chromatin
state and promoter regions.
We performed a systematic exploration of the skin cutaneous melanoma
(SKCM) transcriptome landscape by accessing the TCGA repository
([47]http://cancergenome.nih.gov/) and its entire collection of 103
primary SKCM samples. As TCGA contained only one control sample for
SKCM, we retrieved other reference melanocyte samples in the GEO
repository ([48]http://www.ncbi.nlm.nih.gov/geo/). In order to identify
both gene and ncRNA signatures, we built an ad hoc methodological
pipeline to deal with the observed biotype variety and expression
heterogeneity. Pervasive transcription has been widely documented,
involving multi-exonic transcripts that map to intergenic regions with
specific interactions and functions^[49]16–[50]22. The ability to
detect ncRNAs suffers from low read density coverage, thereby resulting
in non-uniform coverage of transcript abundance. This latter is
generally >100-fold higher in exons than in introns or intergenic
regions. However, despite the limitation of more/less abundant
transcripts represented by many/few reads, respectively, lowly
expressed transcripts can be widely detected by RNA-Seq across many
cancers.
The pseudogene participation to the transcriptional activity deserves a
special attention. Thousands of pseudogenes have been identified in the
human genome, many with critical regulatory functions^[51]23. For
instance, an estimate indicates 12683 human pseudogenes, with 11216
surveyed as consensus pseudogenes and 9% actively transcribed. Since
pseudogenes lack protein coding potential, they can manifest functional
or non-functional RNA products, and appear specific with respect to
both tissues and cells. ENCODE has identified about 15000 pseudogenes,
a great majority of which are transcriptionally active. A recent
large-scale analysis of pseudogene transcription in 13 cancer types
(248 cancer samples, and 45 normal samples) showed the signature of
2082 distinct pseudogenes^[52]4. Several aspects of pseudogenes involve
uncertainty with regard to identification, function, annotation, role
in cancer^[53]24–[54]26. We believe that knowledge gains can be derived
from the use of computational approaches^[55]27–[56]29. For example,
the associations between pseudogenes and so-called parental genes, i.e.
the functional genes with the highest sequence similarity, can be
usefully investigated. One reason is that the complexity of such
relationships goes far beyond the possible presence of correlation
between expression levels. The latter is quite controversial because
inconsistently supported by evidence. Based on the analyzed samples, we
have provided an empirical assessment of such patterns both at systemic
scale and at local level of specific associations^[57]30,[58]31. Our
integrative computational approach involves three main steps, which we
applied sequentially to the SKCM transcriptome data.
* i)
RNA-Seq analysis, performed by multiple methods. The results were
then integrated to achieve a large consensus of differentially
expressed (DE) bioentities, either differentially expressed genes
(DEGs) or DE ncRNAs;
* ii)
Ensemble statistical modeling. This step aimed to recover salient
features that remain hidden in the consensus data. It reconcileed
the profiles obtained from the different methods into a
high-confidence inferable coreset of model-classified bioentities;
* iii)
Bio-annotations (pathways and gene families), organized into
curated lists of pathway and GO categories. These were supported by
other classifications addressing the potential of genes to play a
role in cancer (oncogenes, tumor suppressors, etc.).
We performed two complementary methodological steps: a) The
identification of pseudogenes and corresponding parental genes,
followed by their profiling and by an assessment of their
relationships; and b) The functional validation of selected DEG
candidates to verify both reliability and effectiveness of our ensemble
approach to generate testable hypotheses. The rationale for enabling an
ensemble inference approach is that multiple RNA-Seq methods can be
used to both quantify and profile the expression levels of
transcriptome biotypes, but limitations are present with each method.
Clearly enough, studies centered on specific methods are subject to
risk of bias. Among the alternative strategies that can be used to
overcome the limitations, some establish an efficient practice by
combining results from more than one method. It is also possible that
contradictory results may emerge from using methods operating different
quantifications and normalizations. Our evidences demonstrate exactly
this aspect. In particular, a method-driven selection of different and
separated DE-space regions is obtained, and the significance is
assigned to detections that are in large part not shared. To increase
the consistency of the results, evidences should therefore be
consolidated without aiming necessarily at a consensus. This goal can
be achieved without over-penalizing disparate evidences through the
ensemble modeling approach that we introduce here and demonstrate to be
reasonably efficient and accurate.
Results
Figure [59]1 concisely illustrates the overall findings of our work
through a Sankey flow diagram. The information flow shows the results
of applying our methodological pipeline to the TCGA SKCM data. The
included steps reflect the numeric evidences found by the application
of RNA-Seq methods and model reduction approaches, and the qualitative
refinements obtained from biotype catalogues and bioannotations. We can
state that modeling through an ensemble approach represents a viable
solution to the problem of reconcile multi-evidence RNA-Seq results.
Figure 1.
[60]Figure 1
[61]Open in a new tab
Sankey flow diagram. The outcomes generated by our approach are
visualized step-by-step. The first stage indicated at the left side
lists the various methods applied to the RNA-Seq data with the
corresponding profile, namely the number of DE biotypes (both genes and
ncRNAs) detected in each individual application. A total profile of
9,729 DE biotypes is obtained from the sum of all such detections, and
in particular Limma (5290) and DESeq (4326) which provided the largest
amounts, CuffDiff (3568) and GeneSpring (2758) with smaller amounts,
and finally NOISeq as the most conservative methods (511)
(Supplementary Tables [62]5.1.1-5). Two models are then reported, LRM
and PCA which deliver respectively 896 and 752 DE biotypes, thus
reducing dramatically (more than one order of magnitude) the overall
profile. The rest of the diagram accounts for specific biotypes,
relatively more represented by protein coding genes (906), antisense
(216), lincRNA (84), pseudogenes (136), and with the latter presenting
in 129 cases association with related parental genes (145)
(Supplementary Fig. [63]5.2.1). Bioannotations (pathways, gene
families) are indicated downstream as the endpoint of the diagram.
At a biological level our primary interest is to detect the DE profiles
of biotypes forming the SKCM transcriptome landscape (Fig. [64]1,
left). Such profiles may be provided by different methods, and we chose
five of them to build a consensus. To increase detection accuracy and
exclude redundancy, the ensemble modeling was designed to take place
after the consensus. The ensemble strategy produces better DE profiling
and better biotype decomposition (Fig. [65]1, right). We look
especially at the ncRNAs and focus on pseudogenes and parental genes
associations. Statistical models supporting the ensemble are Principal
Component Analysis (PCA) and Linear Regression Model (LRM). Their
specific structures explain the differential performance. The use of
RNA-Seq expression profiles as predictors cast within the LRM framework
is central to our inference approach. This way, the profiles generated
by the methods are used predictively and their performance assessed
synergistically. At a numerical level, an order of magnitude change
occurred by shifting from consensus (about 10000, Supplementary
Tables [66]5.1) to model-driven significant detections, including DEGs
and DE ncRNAs (about 1000, Supplementary Tables [67]8.1.1-5 and
8.2.1-8; Supplementary Fig. [68]8.1.1-2, 8.2.9-10 and 8.3.1). Data
reduction also induced qualitative advantages with better biotype
annotations (Supplementary Fig. [69]6.1.1.1-5, 8.4.1; Supplementary
Tables [70]8.4.2-4), including pseudogenes (129) and related parental
genes (145) (Supplementary Tables [71]5.3.1-2 and Supplementary
Fig. [72]5.3.3-4).
A summary of outcomes is presented in Fig. [73]2 with the flowchart of
our RNA-Seq analyses: data generation, assemblies, mutational grouping
and validations for subsets of outcomes from different data sources.
Two standard quantifications were considered: read counts and FPKM
(Fragments Per Kilobase Million) (Supplementary Tables [74]S2_1-2).
Other recent proposals^[75]32 aimed to yield transcript lists have not
been used in this work, which is focused on gene-centric analysis.
Thus, isoform detection and alternative splicing are not considered in
this work.
Figure 2.
[76]Figure 2
[77]Open in a new tab
RNA-Seq flowchart. The samples were processed according to the five
different methods that were selected, and whose quantifications
included both read counts (simply the number of reads overlapping a
given feature such as a gene) and fpkm (Fragments Per Kilobase of exon
per Million reads). With the latter it is possible to compare genes of
different lengths, and ‘per million reads’ means that a value
normalized against the library size is obtained. The consensus occurred
between significant detections in both read counts and fpkm scenarios,
before reaching a global result (overall consensus). The mutational
profile at the right side was implemented under simplified algorithmic
conditions (CuffDiff), and for two major mutations (BRAF, NRAS)
(Supplementary Tables [78]10.1-5). The validations refer to candidates
coming from all the considered scenarios.
Differential Expression Profiling
The first set of significant detections, concerning both DEGs and
DE-ncRNAs, were identified by the consensus between profiles using (see
Fig. [79]1 and Fig. [80]2) DESeq^[81]33., NOISeq^[82]34., LIMMA^[83]35,
CuffDiff^[84]36, and the licensed GeneSpring (V 13.0) (Supplementary
Tables [85]3.1-5). These methods were selected to provide different
types of quantifications, normalizations and statistical distributions.
Due to small overlapping regions between the five methods
(Fig. [86]3A), the inferable “DE-space” of significant DE biotype
values appears questionable. The overlap degree from the employed
methods might depend on the influences from non-controllable factors,
for example samples, experiments, algorithms. Ambiguities appear, for
instance with variable (high/low) log(FC) for identical calls obtained
by different methods, indicating variation of signal intensity. This
can be handled simply by averaging. When multiple methods detect with
sign discordancy (overexpression vs underexpression), the uncertainty
increases. Vice versa, detections identified by any particular method
would need confirmation by other methods to be called significant.
Figure 3.
[87]Figure 3
[88]Open in a new tab
Venn diagrams: multiple detection scenarios. Detections refer to all DE
biotypes. (A) Overall consensus. (B) Pseudogene-parental genes
detections. (C) LRM-driven detections. (D) PCA-driven detections. (E)
LRM vs PCA comparison. (F) Biotype classifications in each model.
Thresholded coefficients are obtained by boxplot-derived criteria: from
1.5 x IQR for LRM (Supplementary Tables [89]8.2.5-8), and from most
outlying values for PCA (Supplementary Fig. [90]8.1.1-2).
The presence of well-separated DE-space regions representing the
inferable DE-space, indicates sensitivity to the employed method. When
assembling these outcomes, there is uncertainty about the confidence
levels that may be assigned to such specific results. In principle,
each method has localization power enabled by significance criteria
(either conservative or liberal). These criteria depend on specific
factors, such as distributional hypotheses, built-in normalization
procedures, parametric settings, multiple corrections, etc., all
exerting influence over the final detections. Generally speaking, local
rather than global data features can bring good approximation power for
inherently heterogeneous data, but likely poor generalization power.
Note also that complex experimental data may lack robustness against
possible under-sampling effects, which means that the spectrum of data
features may be inadequately represented. Lastly, among the factors
that exert negative impacts on inference, there are noise, algorithmic
errors, and uncertain annotation of the bioentities (nomenclature from
ENSEMBL as in Supplementary Tables [91]6.2.1-4) in addition to
knowledge gaps and term redundancies typically present in GO and
pathway tools. In general, a lack of substantial harmonization
currently exists and leads to inconsistent results, even in presence of
solid evidences. Of relevance the fact that our data-driven inference
approach is both generalizable and scalable with dimensions and
complexities, and tuned to increased accuracy in annotation tasks.
Building the Ensemble from the Consensus
Figure [92]3 presents a sequence of Venn diagrams describing a map of
detections, which indirectly tells how the methods saturate the
DE-space with their derived profiles. First, a total of 9729 profiled
bioentities is found from the overall consensus (Fig. [93]3A), with the
innermost core sharing 115 bioentities. The core is surrounded by sets
partially contributed by the five methods. Note that the most
peripheral regions indicate the method-specific findings, which present
the largest numbers, i.e. more than 1000 unique detections in each case
(apart from NOISeq). Part of Fig. [94]3A detections, i.e. the ones
relevant for pseudogenes and the associated protein coding genes
classified as parental genes, are shown in Fig. [95]3B (see
Supplementary Tables [96]8.5.3.1-2 and Figs [97]5.3.3–4 for the
sequence analysis results testing the similarity between the two
biotypes). Figure [98]3C shows another Venn diagram, identifying
distinct and shared areas of detections (overall, 896 DE bioentities)
that are specifically or jointly contributed by the RNA-Seq profiling
methods, and cast within the LRM framework. The shared central core is
halved compared to consensus (68 vs 115), and the peripheral regions
with distinct detections are sparsified. Therefore, significantly
estimated LRM coefficients select subsets of the detections, and reduce
the impact of each method alone over the results. Figure [99]3D reports
PCA detections (overall 752), then Fig. [100]3E compares LRM with PCA
and identifies approximately one third of common detections.
Figure [101]3F is for biotype cross-classifications. The observed
limited overlap is relevant, in our view: substantial separation
between the final selections from the model scenarios indicates that
highly informative DE values are not necessarily brought by the
component explaining much of the data variability, but rather depend on
biotype associations that LRM establishes as significant. Figure [102]4
is centered on LRM, our chosen framework for ensemble modeling. The
rationale is that regression offers an efficient setting for
computation with large data sets, and an immediate interpretation of
both the role of the parameters and the significance of the results
(more in Discussion section).
Figure 4.
[103]Figure 4
[104]Open in a new tab
LRM output. Three panels are reported with reference to the performance
of LRM: (A) shows a tree-map view of LRM vs consensus DE-space. The
comparisons are localized in all regions in which both individual
methods and combinations of methods (from 2 to 5) have performed and
delivered their detections. LRM is visible always in just a small
fraction of space. (B) presents the results screenshot of the LRM
script in R. (C) shows comparisons between multiple LRM statistics used
to establish the predicted profile, and given the adopted IQR levels
(top right inset).
Figure [105]4A describes a tree map view of the relative occupancy of
the DE-space by LRM compared to consensus findings. By preserving
proportions between LRM-selected regions and consensus ones, it clearly
shows that LRM reduces enormously data redundancies, and likely also
data complexities related to the algorithmic errors, noise effects,
spurious detections etc. The ensemble strategy is clearly not panacea,
as some of the lost DE-space may still be informative. However, the
detected regions which appear shared between LRM and consensus are more
consistently and robustly identified due to the possibility of merging
synergistically method-specific profiles while quantitatively assessing
the significance of contributions to calls obtained in model-driven
way. Figure [106]4B presents for reproducibility purposes a screenshot
of how LRM runs within R (more details are provided in Supplementary
Tables [107]8.2.1-4). Here, a chosen profile is used as the response
variable (Cuffdiff, in this case) and by the residual method-specific
profiles (explanatory variables or regressors). All combinations are of
course possible, and the algorithm indeed ensures their inclusion.
Figure [108]4.C indicates the statistics used to evaluate the LRM
performance, including both means and medians. Note that Limma turns
out to generate the most outlying profile of the pool. This is
relevant, as the ensemble modeling strategy primarily aims at reducing
data redundancies and complexities, while improving the potential
inferability and generalizability of the approach (even beyond the
melanoma context).
Following the ensemble modeling strategy, it is important to understand
how the LRM regions affect the system’s predictive performance. In
particular, what method should be used for response and what for
predictive purposes in the regression setting? To answer this question,
we chose to run regression model multiple times according to the
rotation of variables serving each role, response vs regressors. This
scheme is known as Alternating Regression Model^[109]37 and measures
how each variable contributes to the predicted response values
according to a cross-validation scheme. Since each variable alternates
its role at both sides of the equation, the final step involves
computing statistics like mean and median values across both the
replicated model runs and the predicted profiles. Intuitively, the LRM
stability improves by rotating the variables, and we noted that an
outlying role is played by Limma. We averaged the final profiles with
and without it, but found the best results when Limma is excluded from
the final averaging step. Its outlying values are not matched by other
methods, meaning that no method contributes to its profile when Limma
is used as the response. Instead, its predictive effects are retained
(i.e. Limma as regressor) because the ensemble LRM models neutralizes
them by assigning non-significance to the corresponding regression
coefficients. Finally, gene and ncRNA signatures derived from each of
the DE profiles required some control of variability, and we applied
thresholding through the boxplots and the IQR measures applied to the
estimated LRM coefficients. The stringency of the assigned thresholds
affects the size of our final selections.
Pseudogene-Parental Genes: relationships
Figure [110]5 focuses on associations between DE pseudogenes and
parental genes, whose patterns are determined by our computational
pipeline. Such patterns have led to controversial conclusions in
literature, due to correlative behavior between the two profiles that
is inconsistently noticed. Figure [111]5A shows profiles measured
within confidence bands for their expression levels. The central insets
represent examples retrieved from a variety of target types.
Figure [112]5B is about patterns from scatter plots measured at various
log2(FC) expression levels. The slopes appear weak, but recurring
across such levels. Despite a relatively low correlation, we note
pattern persistency, which appears as a symptom of dynamics that are
robust to random and outlying erratic effects. Figure [113]5C shows
IQR-driven selections, depending on the statistics. We picked the
simple average at the leftmost boxplot; each threshold choice brings a
different number of significant detections (Supplementary
Tables [114]8.2.5-8).
Figure 5.
[115]Figure 5
[116]Open in a new tab
Pseudogene-parental gene analyses. (A) Joint profiles of pseudogenes
and parental genes from consensus (with significance bars). Insets are
biotype associations examples. (B) Scatters at various expression
levels, and empirical correlation patterns. (C) Protein network of DE
target parental genes with associated at least expressed pseudogenes
(Supplementary Fig. [117]8.6.3 and Supplementary Table [118]8.6.5).
Enlarged networks in Supplementary Fig. [119]8.6.4.
Interesting evidences appear from the network in Fig. [120]5C (similar
methodology was recently proposed in a different disease
context^[121]38). Formed from the DE parental genes associated with
expressed pseudogenes (including DE), the network consists of
interacting proteins retrieved from STRING V10.0^[122]39
([123]www.string-db.org), also annotated with GeneCards: the Human Gene
Database^[124]40 ([125]www.genecards.org). Given an intermediate level
of confidence (0.4), from both experimental and database knowledge,
these interactions occur between proteins relevant for cell
proliferation, adhesion, motility, morphogenesis, together with cell
cycle and oxidative stress.
The parental protein coding gene symbols determining the interactions
are here reported with the symbol–indicating interaction:
ATP1B1-ESAM-VWF-TF-PROS1-GJA1-TUBA4A-CDC20-PLK1-RRM2-NCAPG;
PINK1-PRDX3-PGAM1-ENO3; CTU1-PPA1, HLA-B-GBP1;
PLIN2-CDC42EP3-RHOQ-EXOC5-OPHN1-GNB4-HTR7. To some extent, these are
all processes relevant to cancer, implying potential regulation
underlying the target parental genes represented in the network and
exerted by the either expressed or associated DE pseudogenes. When only
DE pseudogenes are considered, we could observe a correlative scatter
(Fig. [126]5C, bottom right inset). The few associations we found
revealed concordance in both log(FC) size and sign (over/under
expression). Note that some parental genes present multiple associated
pseudogenes, but in general we found only one expressed example among
multiple candidates (more details are provided in Supplementary
Tables [127]8.6.1-5).
Interestingly, the PTEN- induced putative kinase 1 or PINK1,
interacting with PRDX3, appears annotated with oxidative
stress^[128]41, and is relevant because associated to both Parkinson
disease and cancers^[129]42,[130]43. In particular, melanoma is more
frequently associated to Parkinson patients than other cancers^[131]44.
PINK1 overexpression/underexpression respectively protects/sensitizes
cells against/to stress. Also, PINK1 has a crucial role in
mitochondrial homeostasis. Mitochondria are the major cellular sources
of reactive oxygen species (ROS), being critically involved in many
cellular processes, including energy production, metabolism, redox
control, and programmed cell death. Their dysfunction is identifiable
in numerous human diseases, including cancer^[132]45,[133]46. Thus,
PINK1 is essential for survival and a potential drug target due to its
mediating role towards PTEN^[134]47. Despite the expression of PINK1 is
up-regulated in melanoma with high metastatic potential, the molecular
mechanisms by which PINK1 protects cells against apoptosis are unclear
Notably, PINK1 is among the target parental genes that we have
validated (see below).
Bioannotations
Mutations
Among the mutations involved in melanoma pathogenesis, those repeatedly
identified from BRAF and NRAS oncogenes have been
investigated^[135]48,[136]49. Two groups were obtained from both
mutations, and are reported in Supplementary Tables [137]10. The
mutated BRAF is V600E in 70–80% of all BRAF mutations in all cancers.
Their relevance depends on the fact that targeted molecular therapies
have been centered on BRAF, through BRAF inhibitors, and have included
also combinations of oncogenic signaling cascade/pathway targets to
improve long-term patient response and overall survival^[138]50.
Mutations in NRAS, occurring at limited frequency compared to BRAF,
were considered too. These two mutation types are almost always
mutually exclusive. We accounted for the fact that approximately 50% of
melanomas of all clinical types present an activating mutation in the
BRAF oncogene, while about 20% of the cases show an activating mutation
in NRAS, leading to the activation of the MAPK signaling pathway
(cellular proliferation and survival depend on such pathway).
Previously, whole-genome sequencing of 25 metastatic melanomas and
matched germline DNA revealed 11 genes significantly mutated, for BRAF
in 16 samples and for NRAS in 9 samples^[139]3. The raw reads of SKCM
samples were downloaded from TCGA site and were mapped on human genome
using TopHat v2.0.11. SNPs were called by GATK 2.5-2-gf57256b.
Non-synonymous mutations on V600 of BRAF were detected on 57 cases, but
19 were then discarded due to the low sequencing depth (<5x) on the
mutation site, thus remaining with 38 cases. Non-synonymous mutations
on Q61 of NRAS were detected on 17 cases, but 2 were then discarded due
to the low sequencing depth (<5x) on the mutation site, thus remaining
with 15 cases. Also, 33 cases have revealed no mutation on either V600
of BRAF or Q61 of NRAS, thus making the mutation-free group. The DEGs
were detected by CuffDiff v2.2.1, with cutoff q-value < = 5% and
log(FC) >1.2 between the BRAF-mutation group and the mutation-free
group, and between the NRAS-mutation group and the mutation-free group.
Overall, 1254 DEGs were detected in the former comparison, while 716
DEGs were detected in the latter one.
Pathways
We considered annotations from GO (Supplementary Fig. [140]6.1.2.1-5)
and pathway terms (Table [141]1), and the lists generated by each
individual profiling method (Supplementary Fig. [142]6.1.1.1-5) that
formed the union of the consensus, together with the model-driven
lists. ClueGo offers a selection of enriched terms. Note that each
method-derived profile generated a list of terms enriched by gene sets
whose biological significance is naturally affected by the DE space
volume. Among the top-entries (sorted by enrichment) that were listed
for each individual method (Table [143]1 , left), we found some
interesting terms (note that colored entries identify cancer-relevant
terms). The ECM Receptor Interaction appears in four over six lists
(including PCA) followed by the CAM (cell adhesion) term in three over
six lists. PI3K-Akt, Rap1, PD-1 appear in three separate lists.
Table 1.
Pathway terms ranked by methods (left) and LRM. *is for cancer specific
terms.
DESeq LRM
*ECM-Receptor Interaction Collagen Formation, Assembly of Collagen
Fibrils, Collagen Biosynthesis Cytokine-Cytokine Receptor Interaction
*Cell Adhesion Molecules ( CAMs ) *PI3K-Akt Signaling Pathway Protein
Digestion and Absorption *Cancer Specific Terms (FDR-corrected P-value)
Cell Adhesion Molecules (6.53E-10)ECM-Receptor Interaction
(5.68E-06)Focal Adhesion (1.54E-05)PI3K-Akt Signaling Pathway
(2.52E-03)p53 Signaling Pathway (9.31E-03)ECM Organization (3.41E-02)
NOISeq Top Scored Terms (FDR-corrected P-value)
* ECM-Receptor Interaction Collagen formation, Assembly of Collagen
Fibrils, Collagen Biosynthesis, Protein Digestion and Absorption *Cell
adhesion molecules ( CAMs )Leishmaniasis, Interferon Gamma Signaling
Phagosome (7.42E-12)Complement System (8.09E-11)Protein Digestion and
Absorption (6.53E-10)Leishmaniasis (3.59-08)Collagen Formation
(1.26E-08)Allograft Rejection (1.24E-07)Viral myocarditis
(1.55E-07)Graft-versus-host disease (1.75E-07)Type I diabetes mellitus
(3.19-07)Asthma (1.45E-07)Rheumatoid Arthritis (6.46E-06)Cell Cycle
(5.68E-06)Autoimmune thyroid disease (4.02E-05)IgA Intestinal Immune
Network Production (1.31E-05)Tuberculosis (4.31E-05)Hematopoietic cell
lineage (2.02E-04)Pertussis (2.12E-04)Systemic Lupus Erythematosus
(2.60E-04)HTLV-I infection (2.88E-03)Spinal Cord Injury
(1.19E-03)Inflammatory Bowel Disease (IBD) (2.16E-03)Toxoplasmosis
(5.02E-03)Natural killer cell mediated cytotoxicity (6.95E-03)Insulin
Processing (2.43E-03)Leukocyte transendothelial migration (1.27E-02)
CuffDiff
*ECM-Receptor Interaction Immunoregulatory Interactions between a
Lymphoid and a non-Lymphoid cell *Cell adhesion molecules ( CAMs
)Collagen formation, Hemostasis Assembly of Collagen Fibrils Natural
killer Cell Mediated Cytotoxicity *Rap1 signaling pathway
Limma
*ECM-Receptor Interaction Hemostasis, Complement and Coagulation
Cascades, Collagen formation, Cyclin B2 Mediated Events,
Cytokine-Cytokine Receptor Interaction, Collagen Biosynthesis
GeneSpring
Cell Cycle, Mitotic Prometaphase, Mitotic Metaphase and Anaphase,
Separation of Sister Chromatids, Mitotic Anaphase, Resolution of Sister
Chromatid Cohesion
PCA
Rheumatoid Arthritis, Phosphorylation CD3 and TCR, Hematopoietic Cell
Lineage, TCR Signaling, ZAP-70 Translocation, Cytokine-cytokine
receptor interaction *PD-1 signaling
[144]Open in a new tab
By looking at the LRM terms (Table [145]1, right), note that a change
of significance can occur in both terms and gene sets enriching for
them, due to different criteria of selection. Some terms are known to
be strong players in melanoma, and we reported then at the top. CAM,
ECM and PI3K-Akt appear specifically as before, together with Focal
Adhesion and p53. The top-score list includes a variety of terms. More
in detail, PD-1 is an inhibitory receptor found in T-cells, B-cells and
myeloid cells, highly expressed under certain conditions in which
important immune system modulators (T/B cells etc.) are activated. Of
interest the involvement of kinases that are inhibited by PD-1 after
binding to ligands, and thus preventing overstimulation of the immune
response^[146]51–[147]53. Its ligands have different expression
patterns, and interactions with them attenuate the immune response and
protect tumor cells from cytotoxic T-cells. PD-1 is also expressed in
patients with metastatic melanoma particularly within the tumor
microenvironment, suggesting that the immune response to melanoma is
inhibited. PD-1 (likewise PD-L1) is therefore currently considered a
primary therapeutic target. The PI3K (phosphinositide 3-kinase)–Akt
pathway is studied to decipher molecular pathogenesis, heterogeneity
and resistance mechanisms, offering emerging leading targets for
compounds^[148]54,[149]55 [.] Melanomagenesis involves ECM signaling,
p53, RAS/RAF/MEK/ERK, among others. It is also interesting the presence
of the Complement system (LRM, but also Limma), an essential element of
the innate immune response that reacts as a first line of defense
against pathogens and in response to molecular patterns associated with
abnormalities in host cells and extracellular environment^[150]56.
Biotypes & Gene families
We expanded the annotations to include gene families and biotype
decompositions (and more details in Supplementary Tables [151]7.1 and
8.5.1). Here we include further characteristics. In biotypes
(Table [152]2) we list ENSEMBL classes, and we may note the prevalence
of antisense, lincRNA and pseudogenes over the other classes, with
exclusion of protein coding genes (parental genes are a subset of them,
reported apart to emphasize their target role). Together these three
classes of ncRNAs cover in some cases about one fourth (DESeq, LRM) or
even one third (Limma) of the total number of listed DE bioentities.
Gene families (Table [153]3) instead enriched qualitatively the
descriptions of many of the protein coding genes that were classified
in the previous lists. It is interesting to verify the sensitivity of
such classifications by the employed methods, observing that DESeq and
Limma rank first in four classes each, and again Limma and CuffDiff
rank second in four and three classes, respectively. The two models,
PCA and LRM, are of course subject to more restrictions compared to the
consensus, and delivered overall just a fraction of entries in each
class here considered (the only exception being NOISeq, which appears
very conservative).
Table 2.
Classification by Model Biotypes.
Biotypes DESeq NOISeq CuffDiff Limma GeneSpring LRM PCA
3′ Overlapping ncRNA 4 0 1 5 1 0 1
Antisense 412 14 197 646 150 65 157
lincRNA 349 17 243 574 96 50 41
miRNA 14 0 25 5 11 2 11
misc_RNA 6 0 51 12 16 1 1
Parental Genes 355 62 431 549 631 115 49
Processed Transcript 213 8 169 302 55 26 36
Protein Coding 2,811 438 2,390 2,962 2,248 648 418
Pseudogene 468 32 250 618 151 85 72
Sense Intronic 25 0 9 107 14 9 1
Sense Overlapping 16 0 5 26 2 0 0
snRNA 2 0 2 4 2 2 2
snoRNA 6 0 1 23 12 7 4
Total 4,681 571 3,774 5,833 3,389 896 752
[154]Open in a new tab
Table 3.
Gene Families Associated with Protein Coding Genes.
Gene Family DESeq NOISeq CuffDiff Limma GeneSpring PCA LRM
Tumor Suppressors 4 0 8 9 6 1 1
Oncogenes 52 5 57 65 56 12 14
Translocated Cancer Genes 48 5 54 59 51 10 14
Protein Kinases 70 7 79 101 82 15 18
Cell Differentiation Markers 112 27 94 105 45 23 22
Homeodomain Proteins 48 2 18 36 7 4 3
Transcription Factors 179 11 127 176 163 23 16
Cytokines and Growth Factors 127 25 83 83 26 23 13
Total 640 82 520 634 436 111 101
[155]Open in a new tab
Notes: number of detections in each cell refers to differentially
expressed values. ENSEMBL classification (Table [156]2); other
annotations (Table [157]3).
Validations
We first considered consensus data (Fig. [158]6A), and made a few
selections. Then, we considered LRM-derived selections (Fig. [159]6B).
Candidate selections were organized in a list including measurements
and bioannotations for protein coding genes (details in Supplementary
Fig. [160]9.1 and Supplementary Text [161]9.2). Focus went on the list
of DE pseudogenes and DE parental genes, together with other DEGs. As
consensus involved the application of five different RNA-Seq methods,
the selection considered up- or down-regulated DEGs and delivered three
pseudogenes and 111 DEGs, including eight parental ones. All genes were
ranked on the basis of the corrected p-value attributed by each
individual algorithm. As an example, we ranked 115 DEGs on the basis of
the corrected p-value attributed by Exonic DESeq, and then assigned
score 1 to the lowest one, 2 to the second lowest one, and so on. This
procedure was repeated using the corrected p-values attributed by the
other four algorithms. Finally, we summed up all scores together and
calculated a final score. The list was scanned top-down in search for
candidates for which the design of specific primers was possible. The
chosen candidates were at the end three DEGs: ARHGDIB, DCN, SSH1, and
two parental genes: NQO1 and NREP. These genes were DE in melanoma
samples compared to melanocytes. Therefore, in order to confirm these
results, we measured their expression levels in melanocytes and
melanoma cell lines using rtPCR. Both the down-regulation of ARHGDIB
and DCN, and the up-regulation of SSH1 were confirmed. However, we did
not confirm the up- and down-regulation of NQO1 and NREP, respectively.
We then considered the LRM outcomes. For the selection of the genes to
be validated by rtPCR, we prioritized parental genes as the reference
biotype, then we selected genes that encode for proteins belonging to
cancer-related pathways such as the PI3K-AKT pathway (SGK1 and PIK3CD)
or the cell adhesion pathway (MPZ), for kinases (e.g. PINK1, TGFBR1 and
TWF1) or for tumor suppressors (FH). The chosen candidates are nine
protein coding genes, six of which are parental genes. Using rtPCR we
confirmed down-regulation for SGK1, NQO1, PIK3CD, FH, TGFBR1, TWF1,
PINK1 and up-regulation for MGP and MPZ. Note in particular the case of
NQO1, initially not validated according to the consensus, but then
validated according to LRM, this of course depending on the different
selection operated in the two cases.
Figure 6.
[162]Figure 6
[163]Open in a new tab
Validated evidences: from consensus (A) and LRM (B) (details in
Supplementary Fig. [164]9.1 and Supplementary Text [165]9.2).
Discussion
Knowledge advances in cancer genome and transcriptome studies depend,
among other factors, on the integration between experimental and
computational resources and skills. Such type of integration operates
through data-customized bioinformatics pipelines. Despite a substantial
variety of proposed methods, careful literature review suggests that a
restricted number of them are routinely involved in major applications,
thus making them standard approaches. Such process generates a sort of
biased automatism through which the confidence assigned to results
impacts cancer studies, but is not supported by comparative evaluations
between methods or tests measuring sensitivity of results. Literature
has devoted attention to different normalization techniques, for
instance, without demonstrating solid criteria for preference between
methods. The field keeps evolving, naturally enough, but with
uncertainty about our detection accuracy. Additionally, and in cancer
studies especially, some of the existing tools may be too conservative,
and new supersets of data could be needed to cross-validate analyses
and findings. This consideration is applicable to other
studies^[166]57. Our contribution turns to the search for possible ways
of selecting evidence that may yield a solid knowledge base
specifically for melanoma, but not limited to this cancer.
Additionally, we focus on protein coding genes and ncRNAs, and in
particular pseudogene-parental gene associations that were retrieved
from the SKCM data. The relevance is clear: despite pseudogenes are
only dysfunctional copies of protein-coding genes, pseudogene
transcription has a pervasive presence in the human genome^[167]58. The
correlation patterns between intronic pseudogene expressions and
unrelated host gene expression values were not revealed, and no
correlative patterns with their parental genes appeared from the 10
melanoma samples that were included in the study^[168]58. Correlation
between pseudogenes and parental genes was instead found for about 65%
of pairs to be positive, depending also on the number of predicted
miRNA targeting sites^[169]23. Considering another scale, it has been
demonstrated a combined regulative action of pseudogenes over a miRNA
target, the tumor suppressor gene PTEN expression, through the
mechanism known as ceRNA (competing endogenous RNA)^[170]59,[171]60.
Overall, from both biological and computational viewpoints, the
sequence similarity between pseudogenes and parental genes contributes
to the complexity of both sequencing and gene expression profiling, and
even validation. This global complexity calls for additional
research^[172]61. We have indicated association patterns showing a
certain consistency. However, limitations are naturally present and
they mainly concern the fact that detecting pseudogene transcripts is a
complicated matter with a few critical factors to keep in mind:
* Technological limitations: they affect the real detection power and
may act as confounders for predictive algorithms, especially when
the similarity of pseudogenes with parental genes is high;
* Computational problems with regard to alignment, i.e. reads that
align to multiple genomic locations. RNA-Seq presents read mapping
uncertainty causing loss of both information and coverage;
* Assembly limitations: a limited resolution and the fact that
transcript variants from the same gene may have exons in common,
can cause ambiguity difficult to resolve;
* Sequencing depth: it is well-known that low or broadly expressed
genes are not strongly supported by RNA-Seq as only abundant
transcripts are fully assembled;
* Other problems come from sequencing non-uniformity, presence of
novel isoforms (alternatively spliced transcripts), quantification
of expression levels, and harmonization of data normalized
according to various criteria, for instance form experiments with
different sequencing depth.
Following the ideal path of transforming observations (data) into model
outcomes (predictions) to reconcile system’s dynamics, a limitation is
that moving from observations to inference, and then to validation, is
not usually a linear path. In our view, the strategy of moving from a
low- to a high-confidence set of DE bioentities develops along two
routes: a) Empirical approximation through PCA, i.e. projecting the
data to perform de-noising and extracting the prevailing signal in
terms of variability; b) the path of LRM, which exploits the prediction
power of each method according to its significant contribution to a
response profile. Therefore, we may aim at approximating with
acceptable accuracy the recoverable system’s dynamics^[173]62. Ensemble
modeling improves results by targeting a multitude of significant
evidences, which here are the detections obtained from RNA-Seq data,
and it achieves superior accuracy than any individual method. Benefits
have been shown here in terms of bioannotations (GO categories,
pathways, gene families), and validations.
Naturally enough, LRM encounters known limiting factors, such as the
possible violation of linearity, an excess of variability across the
covariates, and the risk of overfitting. Methodological solutions for
such problems that go beyond naïve linear model approaches are
well-known (ridge regression, regularized solutions, LASSO, LARS,
etc^[174]63.). However, we have proven that the stability of the system
can be promptly checked, and corrections can be made to improve the
overall performance. Our regression framework simply stacks together
all detections in a matrix with rows assigned to the distinct
measurements (outcomes) and columns hosting the ‘explanatory variables’
(predictors). The contribution of each method is therefore mediated by
the linear combination which determines the optimal prediction for each
DE entry listed as the ‘dependent variable”. This way, an ensemble
predictor built-in in a regression framework can be provided by
averaging effects across the rows such that all model variables are
re-modulated by significant coefficient estimates aimed at tuning the
prediction power.
Cancer systems present uncertainties in model structures and parameter
values. One would like to measure not only the goodness of fit, but
also the complexity of a model. Reducing the complexity would help
identifying core dynamics, thus increasing the chances of best fit.
The analysis of ensembles of models which vary in parameter values is a
powerful strategy for reducing ambiguity in systems biology and
medicine. The model ensemble may encompass different structures to
capture the systems dynamics. Whether these dynamics are stylized or
simplified is relevant, as cancers are very partially known complex
systems. Therefore, models may be sloppy^[175]64. Sloppiness is not
simply concerning a lack of data such that trivial parametrizations
lead to non-identifiability of the parameters (undetermined systems),
but might involve more subtle aspects, such as redundancy or
compensatory effects between sets of parameters^[176]65. It is also
likely that both important and not important parameters may be often
hard to separate.
To conclude, we recall a very useful methodological principle that we
have pursued through the suggested ensemble approach: along the path of
‘weakening inference algorithms’^[177]66, and in order to solve complex
problems such as dimensionality reduction and identification of
coresets, being these gene sets for example, an effective strategy
requires the concentration of the total information into salient data
features. Although sub-optimal (loss of DEGs and DE-ncRNAs, residual
multicollinearity, spurious correlations from confounding effects
etc.), this idea works under simple model assumptions, such as those
here enabled by regressions. The great value of the performed analyses
is to increase the ‘data strength’ and through this step obtain a
high-confidence compendium of selected bio-entities.
Methods
RNA-Seq Samples
103 primary tumor Cutaneous Melanoma (SKCM) samples were downloaded
from the Cancer Genome Atlas (TCGA) CGHub repository
([178]https://tcga-data.nci.nih.gov/tcga ). 2 control samples were also
downloaded from the Gene Expression Omnibus (GEO) repository^[179]67,
from a study of with data accessible at NCBI GEO database with
accession [180]GSE37169^[181]68. The total number of 105 samples were
downloaded in FASTQ format^[182]69.
Sample Processing
All sample reads were mapped to the Ensembl^[183]70 human genome
reference (Genome Reference Consortium build 37, Ensembl v.72) using
the TopHat 2 spliced aligner^[184]71. The human genome reference
sequence was prepared for mapping by collating and indexing the 25
standard chromosomes (22 autosomes, X and Y Chromosome, and the
Mitochondrial chromosome) from Ensembl. The ‘bowtie2-build’ indexer
program from Bowtie2^[185]72 was used as the indexing step and the
TopHat2 spliced mapper was used to independently map each of the sample
reads from TCGA and GEO. Tophat2 requested a GTF annotation file
([186]http://mblab.wustl.edu/GTF22.html) from Ensembl. After mapping,
Samtools^[187]73 was used to process the output files from the 105
samples, thus converted to a format (.BAM) suitable for downstream
quality assurance and analysis steps.
Quality Control
Each sample was inspected to detect possible anomalies. Issues emerged
with the percentage of mapped reads in two of the TCGA tumor samples,
therefore excluded. The residual 101 TCGA tumor samples, and the two
controls (from GEO), were thus analyzed.
Mapping Qualities
Mapped reads for each sample were sorted and binned based on their
mapping quality (MAPQ) values. SamTools was used to filter reads based
on MAPQ values. TopHat2 bins read alignments based on 5 MAPQ classes:
values of 0, 1, 2, and 3 are assigned for read alignments with multiple
hits of 10, 4-9, 3, and 2 locations, respectively; a value of 255 is
for unique alignments. Only unique read alignments for each sample were
extracted from the output files, and used for downstream analysis.
Expression Quantification
Unique read mappings from the 103 samples were used to create
expression units for each gene in the Ensembl v.72 database, this for
each sample. Each gene in Ensembl was quantified using FPKM and Read
Count units^[188]74. We applied two constraints: genes showing an FPKM
greater than 1 and showing >10 reads in at least 10% of samples were
included for downstream analysis, in both conditions. Different
combinations and permutations of FPKM and number of reads at variable
percentages were tested (details in Supplementary Table [189]2.1-2 and
Supplementary Fig. [190]4.1-2).
* FPKM. FPKM units were calculated using CuffDiff. The
CummeRbund^[191]75 R package to process and extract the FPKM data.
A table of FPKM values was created for downstream analysis,
containing genes as rows and samples as columns.
* Read Counts. Read count units were calculated using the HTSeq
software^[192]76 for both exonic reads (reads mapping to exons
only) and intronic reads (reads mapping to both exons and introns).
A table of read count values was created for downstream analysis,
containing genes as rows and samples as columns (Supplementary
Tables [193]5.4.1-2 report intron results too).
Pseudogene-Parental Genes
The criteria for assessing transcribed evidence are mainly two: 1)
Reads mapped to the pseudogene sequence and not to the parental gene;
2) Reads mapped to both the pseudogene and to the parental gene, but
with lower sequence similarity (<90%). Target parental genes were
identified by aligning the pseudogene sequences using BLAST against a
database of the protein coding CDNA sequences from Ensembl (v. 72). The
best hit matches for a pseudogene sequence were selected based on
e-value scores, and the best overall hit for a pseudogene was selected
as its parental gene. BLAST results with e-value scores can be found in
Supplementary Table [194]1.1. In total, 13,151 candidate associations
were found from the BLAST results. After removing duplicates, 4,035
unique protein coding parental genes were identified (Supplementary
Table [195]1.2). Parental gene and child pseudogene associations can be
found from consensus (Supplementary Tables [196]8.5.3.1-2 and
Supplementary Fig. [197]5.3.3-4).
Differential expression
FPKM and read count values were used for testing relative differential
expression between the TCGA primary tumor samples and the GEO controls.
Five (5) Methods were employed with different (FPKM or read counts)
units as input. FPKM-based chosen methods were: limma from
Bioconductor^[198]77, GeneSpring
([199]http://www.genomics.agilent.com/), and CuffDiff. The chosen read
count methods were DESeq and NOISeq (both in Bioconductor). Each method
was run using the standard protocol and parameters available in their
documentation, and output p-values were corrected for multiple
comparisons using false discovery rate (FDR).
Methods for Profiling
Each of the differential expression methods pooled the two conditions
tumor and control together (Supplementary Tables [200]5.1.1-5).
* DESeq. The R package (v. 1.18.0) performs analysis of read counts
from HTSeq by estimating variance-mean dependencies, normalizing by
an estimate of the effective library size, and testing for
differential expression using a negative binomial distribution.
* NOISeq. The NOISeq package (v. 2.8.0) is a non-parametric method
for analyzing read count data from HTSeq. There are no assumptions
about data distributions, and normalization of count data is by
using a trimmed mean of M-values (TMM)^[201]60.
* CuffDiff. The CuffDiff software (v. 2.2.1) models the variability
of a gene’s RNA-Seq measurements by considering RNA-Seq fragment
mappings and the gene’s splicing structure. CuffDiff normalizes the
library size by using a quartile normalization, and estimates the
replicate dispersion with a Poisson model.
* GeneSpring. The GeneSpring software (v. 13.0) (Agilent
Technologies) analyzed FPKM values obtained from CummeRbund. A
standard workflow for expression data was used, and no further
normalization was applied. Tests for differential expression were
calculated using a Moderated T-Test.
* limma. The limma package (v. 3.22.6) analyzed the FPKM data from
CummeRbund. The “classic” workflow was performed, i.e., no ‘voom’
transformation was applied. Tests for differential expression were
calculated using a Moderated T-Test.
Pathway Analysis
DEGs from each algorithm were analyzed for pathway enrichment using the
ClueGO^[202]78 plug-in for Cytoscape^[203]79
([204]http://apps.cytoscape.org/apps/cluego) The ClueGO db for Gene
Ontology (GO)^[205]80, Kyoto Encyclopedia of Genes and Genomes
(KEGG)^[206]81, WikiPathways^[207]82, and Reactome^[208]83 were used
for pathway enrichment analysis with default values, with the GO
analysis being run independently from the other databases. Several
molecule pathways are associated with melanoma, and some with clinical
relevance (see a comprehensive list in^[209]84. Statistical analysis
included enrichment performed with Benjamini-Hochberg p-value
correction. Details on various enrichments are reported in
Supplementary Tables [210]8.4.2-4.
Principal Component Analysis (PCA)
The function ‘prcomp()’ from the R “stats” package was used to perform
PCA on both the counts and the FPKMs, thus covering profiles from all
the methods. PCA coefficients from the first PC were used to perform
box-plot thresholding (data points outside the maximum and minimum
whisker boundaries were kept as outliers).
Linear Regression Model (LRM)
A merged matrix of log2(FC) from the differential expression results of
all methods was created and used as input to LRM. The matrix consisted
of genes (rows) and algorithms (columns), with each cell containing
gene or ncRNA values. Values that were not identified by a method were
flagged as missing in the matrix. The standard form “y ~ model” is
applied, by the R function “lm()” designed to fit the LRM according to:
[MATH: CuffDiff~DESeq+NOISeq+Limma+GeneSpring :MATH]
The above simply denotes that the outcome is assigned by one method,
say CuffDiff, establishing the model response variable associated to
explanatory variables forming linear combination of values obtained
with DESeq, NOISeq, Limma, and GeneSpring (i.e. the other methods). The
placement of such methods as response vs explanatory variable is
subject to rotation, according to the idea of Alternating Regression
Modeling. Predictions by the model were obtained using the R function
“predict.lm()”, then filtering by a suitable multiplier of the
interquartile range (IQR). Preference went to predicted values outside
3*IQR range.
Further notes on models
PCA-driven selections
In principle, PCA primarily finds the data directions explaining most
of its variation. A dataset D, normalized into D’, has covariance
matrix computed as COV(D’), and its eigenvectors (sorted from the
largest to the smallest eigenvalues) represent the PCs. The principal
component coefficients, also known as loadings, are to be found. Given,
say, the n-by-p data matrix D, rows of D correspond to observations and
columns correspond to variables. The coefficient matrix is p-by-p. Each
column contains coefficients for one PC, columns in descending order of
component variance. Equivalently, the PCs return information contained
in the projected data through the correspondingly estimated
coefficients, with the highest relevance assigned to the direction of
maximal variance (or principal direction). In our setting, the tumor
baseline signal is contrasted with the control data, and the noise is
reduced by shrinkage operated from the boxplots. Therefore, each method
is subjected to extraction of the first component (PC1), and a set of
boxplot outliers is obtained in each case.
LRM-driven selections
With regression, thresholding is applied via the interquartile range or
IQR, a measure derived from boxplots. Optimality would imply some
assumptions, a main one being predictors’ independence. This assumption
does not usually hold, and sub-optimal conditions apply. As our setting
presents intersections among the identifications obtained from five
methods (e.g. overlapping sets), the emerging gene or ncRNA signatures
must be confronted with the inevitable collinearity between regressors,
i.e. the method-dependent profiles. Such redundancy may be reduced by
enforcing shrinkage to the predicted values, through a robust selection
rule, i.e. IQR × 1.5 (2 or 3). The selections represent linear
combinations of multiple profiles.
Model selection
There might be a subset of predictors relatively more important than
the rest. One strategy is to insert penalties in the model, useful to
control the number of predictors involved. As an alternative, we
proposed an ARM model type, one working by rotating the dependent and
independent variables. Once an initial response is identified with one
of the variables, the residual variables play the role of
regressors/predictors for the response. It is a sort of cross- or
hold-out validation principle applied to assess the goodness of fit of
each method. Repeating the regression generates vectors of fitted
values that need to be averaged at the end, thus yielding the final
consensus DE vector. Note two aspects: (a) The dimension of each
response varies depending on the profile to predict. The matrix of
regressors will adapt to the response’s dimension by only matching
selected response values; (b) Estimates of the regression coefficients
guide the selection of variables (i.e. if the coefficients are
significant, then the variables will be considered to contribute to the
fitted values, otherwise they will be excluded).
Note that LIMMA inclusion revealed not significant regressors. Thus,
LIMMA fitted values do not benefit from regression modeling, under the
given hypotheses. This means that by delivering a much wider but
different profile than other methods, LIMMA covers the gene space in
regions unexplored by the other methods. However, this subspace is
uncertain in terms of number of false positives. We cautiously keep
LIMMA out of the final average for the fitted values. The LIMMA anomaly
was also emphasized with the use of PCA, and in that case too the bias
introduced was potentially too high to keep LIMMA’s evidences stacked
with the other ones. In terms of final consensus DE profile, we
obtained comparative results with crude average of outcome DEGs, total
averages from regression, and fitted averages obtained from ARM with
and without LIMMA. The most balanced distribution is always obtained in
the last scenario.
Data Availability & Software
All software and datasets can be obtained at the Center for
Computational Science’s GitHub repository, available at the following
URL: [211]https://github.com/ccs-bio/melanoma-transcriptomics. At this
repository, expression tables (FPKM and read count values) and scripts
(Python, Bash, R, etc.) for both RNA-Seq sample alignment and filtering
to extract unique read mappings are reported. R scripts for
differential expression can also be found for DESeq, NOISeq, and limma.
A bash shell script is also included with the CuffDiff command call,
along with the R script for processing the data with CummeRbund. The
experimental grouping for GeneSpring is also available., as well as
scripts for visualizations. All source code will be released under the
GPL v.3 license.
Ensembl Annotations
The analysis was based on the Ensembl annotations version 72. Ensembl
releases versions each year (Suppl Fig. [212]S6_2_4), and the current
version is 86. Version 72 (Supplementary Table [213]6.2.1) of the
annotations contained a total of 62,893 genes, while version 86
(Supplementary Table [214]6.2.2) contains 58,051. Comparing both sets
of annotations we found that 51,858 (82.45%) genes are contained in
both, while 11,035 (17.55%) of the version 72 genes are not present in
the current version, v.86. A comparison between genes present in the
latest version of the annotations relative to version 72 is included in
Supplementary Table [215]6.2.3)
Validations
Cell lines
Cells were grown at 37 °C in a humidified atmosphere with 5% CO2. All
the media used for cell culturing are reported in^[216]85, and were
supplemented with 1% Penicillin/Streptomycin (Euroclone; Milano,
Italy). The mutational status of BRAF and NRAS is reported in^[217]85
too.
Sequence retrieval & alignment
The cDNA sequences of the genes of interest were obtained from
Ensemble, inserting the code available in the tables. In order to align
sequences and detect their homology CLC Sequence Viewer 7 was used.
Primers
Primer design was performed using Primer3 ([218]http://primer3.ut.ee/),
while UCSC in silico PCR
([219]https://genome.ucsc.edu/cgi-bin/hgPc)rwas used to confirm the
specificity of primers (absence of off-targets). All primers were
purchased from Primm srl (Milano, Italy). Sequences are reported in
Table [220]1-[221]4 and Supplementary Text [222]9.3.
RNA extraction and retrotranscription
RNA was extracted using QIAzol (Qiagen; Venlo, Netherlands), following
the manufacturer’s instructions. 1 μg of total RNA was treated with
DNAse I, amplification grade (Invitrogen; Carlsbad, California)
following the manufacturer’s protocol. 0.5 μg of RNA treated with DNAse
I was retrotranscribed on a S1000 Thermal Cycler (Bio-Rad; Hercules,
California) using iScript cDNA Synthesis Kit (Bio-Rad; Hercules,
California). The successful retrotranscription and the absence of
contaminating genomic DNA were routinely checked through a control PCR
in which exon-spanning primers for ATPA1 gene are used. cDNA was
diluted (1:4) and used for qRT-PCR.
qRT-PCR
1 μL of cDNA and appropriate primers (Table [223]1) were used for
qRT-PCR reactions, using SsoAdvanced Universal Supermix (Bio-Rad;
Hercules, California) on a CFX96 Real-Time System (Bio-Rad; Hercules,
California). The reaction conditions were the following: 98 °C 30 sec,
(98 °C 3 sec, 60,4 °C 20 sec, 72 °C 10 sec)x39cycles. In order to
confirm the specificity of the reaction, a melting curve was performed
after each PCR (from 65 °C to 95 °C with an increase of temperature of
0.5 °C/sec). The annealing temperature was optimized by performing a
gradient with each primer pair (from from 55 °C to 64 °C). All
reactions were performed in duplicate; the raw values were calculated
using three housekeeping genes as reference (GAPDH, SDHA, PBGD)
according to the ΔΔCq method used by the CFX Manager Software (BioRad).
Electronic supplementary material
[224]Supplementary Files^ (58.4MB, zip)
Author Contributions
E.C. wrote the paper, conceived the ensemble methodological approach
and supervised the analyses. C.V. implemented the overall RNA-Seq
pipeline (coding and analyses), and the model-driven analyses,
contributing to the preparation of the material and methods. Z.J.
performed the mutational analysis and the specific RNA-Seq analyses.
S.S. performed functional validations supervised by L.P., who provided
guidance on pseudogenes and mutational analysis. N.T. supervised the
global project, offered insights while assessing the work, and allowed
access to the TCGA data.
Competing Interests
The authors declare that they have no competing interests.
Footnotes
Electronic supplementary material
Supplementary information accompanies this paper at
10.1038/s41598-017-17337-7.
Publisher's note: Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
References