Graphical abstract
graphic file with name fx1.jpg
[25]Open in a new tab
Highlights
* •
The reproducibility of protein abundance measurements varies across
proteins
* •
Measurement reproducibility contributes to variation in
mRNA-protein correlations
* •
Aggregated protein reproducibility score integrates multiple
reproducibility estimates
* •
Pathways with high mRNA-protein correlations may just be more
reproducibly measured
Motivation
Because they are easier to measure in high throughput, mRNA abundances
are often used as a proxy measurement for protein abundances. However,
there is only a moderate correlation between the two, and it is unclear
to what extent this moderate correlation reflects post-transcriptional
regulation and to what extent it can be attributed to measurement
error. Here, by analyzing samples with replicate proteomes, we quantify
the extent to which replicate measurements of the same proteins are
correlated. We rank proteins according to their reproducibility and
show that more reproducibly measured proteins have higher mRNA-protein
correlation, suggesting that measurement error limits mRNA-protein
correlation.
__________________________________________________________________
Upadhya and Ryan identified substantial variation in the
reproducibility of protein abundance measurements across proteins. They
develop an aggregate protein reproducibility score and show that this
score explains variation in mRNA-protein correlations across multiple
proteogenomic studies. They show that pathways with higher-than-average
mRNA-protein correlations may simply be more reproducibly measured.
Introduction
Proteins are the primary actors in our cells, responsible for almost
all biological activities. Therefore, understanding how protein
abundances vary between healthy and disease states can provide an
insight into how biological activities are altered in disease
conditions. Among patients with the same disease, e.g., breast cancer,
variation in protein abundances may explain differences in survival
outcomes ([26]Ősz et al., 2021) and drug responses ([27]Shenoy et al.,
2020). Consequently, significant efforts have been made recently to
characterize proteomes across large patient cohorts ([28]Ellis et al.,
2013). However, our ability to quantify protein abundances at scale has
lagged behind our ability to sequence genomes and quantify mRNA
abundances. Large-scale efforts to molecularly characterize healthy and
disease samples from humans have therefore primarily focused on DNA
sequence variation and transcriptomic variation.
As transcriptomes are easier to quantify than proteomes, mRNA
abundances are often used as a proxy for protein abundances. However,
the relationship between mRNA abundances and protein abundances is
complex and non-linear and varies significantly from protein to
protein. Consistent with this, large-scale studies in humans and model
organisms have revealed that for most genes there is only a moderate
correlation between mRNA and protein abundances ([29]Buccitelli and
Selbach, 2020; [30]Vogel and Marcotte, 2012). We note that correlations
between mRNA and protein abundances can be calculated in two different
ways: across all proteins within a given sample (i.e., in a given cell
line, are the most abundant proteins also the most abundant
transcripts?) or for a single protein across multiple samples (i.e., do
the samples with the highest levels of a specific protein also have the
highest number of transcripts coding for that protein?) ([31]Franks
et al., 2017; [32]Liu et al., 2016; [33]Vogel and Marcotte, 2012).
Here, we are concerned with variation across individuals, and so
throughout when we discuss mRNA-protein correlations, we are
calculating the correlation between the protein and transcript
abundance for an individual protein across samples.
Tumor samples in particular have been subject to transcriptomic and
proteomic profiling efforts, and these have provided insight into how
variation in mRNA abundances across individuals is associated with
variation in protein abundances across the same individuals. These
studies have reported an average mRNA-protein correlation in the range
of ∼0.2–0.5 ([34]Mertins et al., 2016; [35]Zhang et al., 2014,
[36]2016). This moderate correlation between mRNA and protein
abundances can be attributed to both biological and technical factors.
Major biological factors that influence mRNA-protein correlation
include translation rates that vary across proteins and conditions,
highly variable half-lives for both proteins and mRNAs, and
post-translational modifications that can alter protein stability and
degradation ([37]Buccitelli and Selbach, 2020).
Different proteins have been observed to have very different
mRNA-protein correlations, and pathway enrichment analyses have
identified specific functional groups with lower- or
higher-than-average mRNA-protein correlations. For instance, a number
of metabolic pathways have been shown to have higher-than average
mRNA-protein correlations ([38]Clark et al., 2019; [39]Huang et al.,
2021; [40]Jarnuczak et al., 2021; [41]Mertins et al., 2016; [42]Zhang
et al., 2014, [43]2016), suggesting limited post-transcriptional
regulation of these proteins. In contrast, subunits of large protein
complexes have been shown to have lower-than-average mRNA-protein
correlations, suggesting significant post-transcriptional regulation
([44]Gonçalves et al., 2017; [45]Ryan et al., 2017; [46]Taggart et al.,
2020; [47]Wang et al., 2017; [48]Wu et al., 2013). Another factor that
might influence mRNA-protein correlations across samples is the
intrinsic variability in mRNA expression. mRNAs that do not vary across
samples, such as those whose expression is usually tightly regulated,
will not correlate with their corresponding proteins because variation
is essential to observe correlation. As we focus our analysis on tumor
profiles, where extensive copy-number alterations result in significant
variation in mRNA abundances, this issue is a smaller concern.
Our technical ability to accurately and reproducibly quantify both
mRNAs and proteins is potentially a major factor that influences the
mRNA-protein correlation. If the error in our measurements is large, we
would expect this error to reduce the correlation between mRNA and
protein even in the absence of the biological factors outlined above. A
number of studies have separately assessed the reproducibility of
either mRNA ([49]’t Hoen et al., 2013; [50]Marioni et al., 2008;
[51]SEQC/MAQC-III Consortium, 2014) or proteomic ([52]Casey et al.,
2017; [53]Tabb et al., 2010) profiling approaches. Others have explored
how measurement errors in mRNA or proteomic profiling can influence the
reported correlation between mRNA and protein abundances within sample
correlations (across all proteins within a single sample/cell line)
rather than across samples (for individual proteins across many
samples) ([54]Csárdi et al., 2015; [55]Li et al., 2014). Here, we
analyze studies of tumors and cancer cell lines with replicate
proteomic profiles in order to assess the impact of measurement
reproducibility on mRNA-protein correlation that can be observed for
individual proteins across samples.
Results
A standardized pipeline reveals differences in the mRNA-protein correlation
across studies
The average mRNA-protein correlation reported for different tumor
proteomic profiling efforts varies substantially across studies—ranging
from 0.23 in an early proteomic study of colorectal cancer ([56]Zhang
et al., 2014) to 0.53 in a recent study of lung adenocarcinoma
([57]Gillette et al., 2020) ([58]Table 1). However, it is not
meaningful to directly compare the reported correlations because the
methods used to quantify the mRNA-protein correlation have varied
across studies—different studies have used different summary statistics
(mean versus median), different correlation metrics (Pearson versus
Spearman), and different criteria for protein inclusion (e.g., no
missing values, at least 30% measured values, only the 10% most
variable proteins) ([59]Table 1). To enable a more direct comparison
across studies, we calculated the mRNA-protein correlation for thirteen
proteomic studies using a standardized pipeline. The datasets analyzed
comprise ten studies of tumor samples ([60]Clark et al., 2019; [61]Dou
et al., 2020; [62]Gillette et al., 2020; [63]Huang et al., 2021;
[64]Krug et al., 2020; [65]Mertins et al., 2016; [66]Vasaikar et al.,
2019; [67]Wang et al., 2021; [68]Zhang et al., 2014, [69]2016), two
studies of cancer cell lines ([70]Guo et al., 2019; [71]Nusinow et al.,
2020), and one study of healthy tissues ([72]Jiang et al., 2020).
Within each study, we calculated the median Spearman correlation
between mRNA and protein for all proteins that were measured in at
least 80% of samples ([73]STAR Methods; [74]Tables 1 and [75]S1).
Applying the same pipeline using Pearson correlation rather than
Spearman correlation revealed broadly similar results ([76]Table 1),
and so throughout the remainder of the paper, we focus our analysis on
correlation calculated using Spearman correlation as it is the metric
most commonly used in proteogenomic studies (9 of 13 studies).
Table 1.
Analysis of mRNA-protein correlation using a standardized pipeline
Data Published year Reported correlation Protein inclusion criterion in
reported correlation Computed median Spearman correlation Computed
median Pearson correlation
GTEx 32 healthy tissues (GTEx) 2020 0.46 <5 tissues with missing values
for both protein and RNA measurements 0.51 0.59
Cancer Cell Line Encyclopaedia (CCLE) 2020 0.48 quantified in at least
one ten-plex (9 cell lines) 0.46 0.48
NCI-60 cancer cell lines (NCI60) 2019 not reported – 0.36 0.40
Glioblastoma (GBM) 2021 not reported – 0.50 0.51
Head and neck squamous cell carcinoma (HNSCC) 2021 0.52 <50% missing
values 0.54 0.56
Lung adenocarcinoma (LUAD) 2020 0.53 <50% missing values 0.55 0.56
Endometrial cancer (EC) 2020 0.48 contain mRNA and protein measurements
across all patients 0.48 0.51
Breast cancer (BrCa 2020) 2020 0.41 contain mRNA and protein
measurements (proteins <70% missing values) 0.44 0.43
Clear cell renal carcinoma (ccRCC) 2019 0.43 contain mRNA and protein
measurements across all patients 0.41 0.42
Colon cancer (colon) 2019 0.48 top 10% most variably expressed proteins
quantified in both platforms 0.27 0.28
Ovarian cancer (ovarian) 2016 0.45 contain mRNA and protein
measurements across all patients 0.41 0.41
Breast cancer (BrCa 2016) 2016 0.39 contain mRNA and protein
measurements across all patients passing quality control checks. 0.42
0.42
Colon and rectal cancer (CRC 2014) 2014 0.23 protein measurement with
average spectral count across all patients ≥1.4 0.21 0.22
[77]Open in a new tab
Across all studies, the median recalculated correlation was 0.43 with a
maximum of 0.55 (lung adenocarcinoma [LUAD]; [78]Gillette et al., 2020)
and a minimum of 0.21 (colorectal cancer [CRC]; [79]Zhang et al.,
2014). In some instances, the recalculated correlation was similar to
that originally reported, but in others there was a substantial
difference. For example, the correlation recalculated for endometrial
cancer (0.48) was the same as originally reported ([80]Dou et al.,
2020), while the recalculated correlation for colon cancer was much
lower than that reported by the authors (0.27 versus 0.48)
([81]Vasaikar et al., 2019). This is because the colon cancer study
reported the mean mRNA-protein correlation for only the 10% most
variable proteins rather than the full set of proteins. These highly
variable proteins have higher than average mRNA-protein correlations.
More recent studies appear to have higher mRNA-protein correlations,
e.g., we observe a mean of 0.49 for studies published after 2019 versus
0.35 for studies published in 2016 or earlier ([82]Table 1). This
cannot simply be attributed to differences in the cancer types studied
in different years, as the two cancer types profiled twice (colon and
breast) see an improvement from the earlier studies ([83]Table 1). This
would suggest that technical and experimental factors may influence the
reported mRNA-protein correlations and that improvements in either
technology or experimental protocols have resulted in improved
mRNA-protein correlations over time.
The correlation across replicate proteomic profiles is only moderate
To assess the reproducibility of mass spectrometry-based proteomic
measurements, we analyzed three studies containing replicate proteomic
profiles: ovarian tumor samples ([84]Zhang et al., 2016), colon tumor
samples ([85]Vasaikar et al., 2019), and cancer cell lines of mixed
lineages from the Cancer Cell Line Encyclopedia (CCLE) ([86]Nusinow
et al., 2020) ([87]Figure 1A). The nature of the replicates varies
across the different studies: for ovarian cancer, the same tumor sample
was profiled in two different laboratories, for the cancer cell lines,
biological replicates were performed within the same lab 1 year apart,
while for colon cancer, the same tumor samples were profiled with two
different mass spectrometry (MS) techniques, i.e., isotope-based
protein quantification (TMT-10) and label-free spectral counting MS.
Thus, there is diversity in the replicate proteomic profiles in terms
of sample types (tumor samples and cancer cell lines), sites, and
techniques used to quantify the proteins.
Figure 1.
[88]Figure 1
[89]Open in a new tab
Protein-protein reproducibility across replicates is moderate and
variable
(A) Overview of the replicates available for the three different
proteomic studies.
(B) For each study, we calculate the Spearman correlation for
individual proteins across the proteomic replicates. The distribution
of the protein-protein reproducibility is shown in the histogram for
all measured proteins. For each study, the black dashed line represents
the median.
Many biological factors that influence mRNA-protein correlation, such
as post-transcriptional regulation, are not relevant in the case of
replicate measurements of proteins, and so we expected the replicate
proteomic profiles to be more highly correlated than mRNA and protein
profiles. This was indeed the case for all studies. The median
protein-protein reproducibility for the replicate proteomic profiles
from the CCLE dataset was 0.72 ([90]Figure 1B; [91]Table S2), whereas
the median mRNA-protein correlation was only 0.48 ([92]Table 1). The
median protein-protein reproducibility for the replicate proteomic
profiles of ovarian tumors was 0.57 ([93]Figure 1B), which is higher
than the median mRNA-protein correlation of 0.41 ([94]Table 1). The
replicate protein-protein reproducibility for the colon study (median
0.28) was much lower than that observed for the other studies. However,
it was still higher than the median-calculated mRNA-protein correlation
(0.21). One reason for the colon study to have a low median
protein-protein reproducibility is that one of the two replicate
proteomic profiles is quantified using label-free/spectral counting MS,
which is not as accurate as the stable isotope-based protein
quantification methods ([95]Liu et al., 2016). Overall, we can conclude
that although protein-protein reproducibility is consistently higher
than mRNA-protein correlations, the protein-protein reproducibility is
still only moderate.
Proteins with higher reproducibility have higher mRNA-protein correlation
The moderate correlations reported between mRNA and protein abundances
have been attributed to a variety of biological factors, including
post-transcriptional regulation, varying translation rates, and varying
degradation rates ([96]Buccitelli and Selbach, 2020; [97]Payne, 2015;
[98]Vogel and Marcotte, 2012). However, our observation that some
proteins can be quantified more reproducibly than others suggests that
noise in quantification may also be a major factor. If this is the
case, we would expect that proteins that can be more reproducibly
quantified will have a higher mRNA-protein correlation. To assess this,
for each study we used the replicate proteomic profiles to stratify the
proteins into deciles, ranging from the 10% of proteins with the lowest
protein-protein reproducibility to the 10% with the highest
protein-protein reproducibility ([99]STAR Methods). We then calculated
the mRNA-protein correlation for all of the proteins within each
decile. We found, for all three studies, that the median mRNA-protein
correlation increases with protein-protein reproducibility
([100]Figure 2). The colon cancer study shows a difference in the
median mRNA-protein correlation of 0.33 between the first and last
deciles of protein reproducibility. Similarly, ovarian cancer data show
a difference of 0.35, and the CCLE data show a difference of 0.37. This
indicates that the reproducibility of proteomic measurements has a
major impact on the calculated mRNA-protein correlation. We used a
linear regression model to understand how much of the variation in
mRNA-protein correlation can be explained by variation in
protein-protein reproducibility and found that it explains
approximately 14%, 17%, and 23% in the ovarian, CCLE, and colon
studies, respectively ([101]STAR Methods; [102]Figure 2 and [103]S1A).
Figure 2.
[104]Figure 2
[105]Open in a new tab
Proteins with higher reproducibility have higher mRNA-protein
correlation
(A–C) Boxplots showing the distribution of mRNA-protein correlation for
proteins binned according to their protein-protein reproducibility in
the colon (A), ovarian (B), and CCLE (C) studies. The total number of
proteins considered for each plot is indicated at the top right corner.
The bins are deciles—each containing ∼10% of the proteins. The decile
is indicated on the x axis along with the highest correlation between
experimental replicates present within that decile. For each box plot,
the black central line represents the median, the top and bottom lines
represent the 1st and 3rd quartiles, and the whiskers extend to 1.5
times the interquartile range past the box. Outliers are not shown. The
median of each decile is indicated above/below the black central line
for each box plot. The median mRNA-protein correlation across all
proteins for each study is indicated as a dotted gray line in each
plot. The R^2 obtained from regressing the mRNA-protein correlation on
protein-protein reproducibility is in the bottom right corner.
Previous work has identified protein complex membership as the factor
most predictive of variation in mRNA-protein correlation, with subunits
of protein complexes typically having lower-than-average mRNA-protein
correlation ([106]Gonçalves et al., 2017; [107]Ryan et al., 2017).
Using the same linear modeling approach as above, we found that protein
complex membership explains approximately 3%, 8%, and 6.7% of the
variation in the ovarian, CCLE, and colon studies, respectively
([108]Figure S1A). This suggests that noise in the quantification of
protein abundances explains much more (on average ∼3 times) of the
variance in mRNA-protein correlation than the most predictive
previously identified factor. Combined, the protein-protein
reproducibility and protein complex membership features explained
approximately 17%, 23%, and 26% of the variation in mRNA-protein
correlation in the ovarian, CCLE, and colon studies, respectively
([109]Figure S1A). This is significantly more than protein complex
membership or protein-protein reproducibility alone (p < 0.001,
likelihood ratio test), suggesting that protein complex membership and
protein reproducibility independently contribute to the variation in
mRNA-protein correlation. This is also evident when binning proteins
into reproducibility deciles—although proteins that are complex
subunits are present in every decile, they have consistently lower
mRNA-protein correlations ([110]Figures S1B–S1D).
Proteins with high reproducibility in one study are also highly reproducible
in other studies
In addition to providing a summary of how reproducible the protein
measurements from each study are on average, the replicate profiles
enable us to see which proteins are most reproducibly quantified
overall. In the CCLE study, the median correlation between replicate
measurements calculated across all proteins was 0.72, but this ranged
from −0.2 to 1.0 for individual proteins. Similarly, the median for all
proteins in the ovarian study was 0.57, but the individual correlations
ranged from −0.6 to 1.0, and the median for the colon tumor study was
0.28 with a range from −0.2 to 0.8. This suggests that, at least within
individual studies, some proteins may be more reproducibly quantified
than others.
To understand whether the same proteins were reproducibly quantified
across multiple studies, we analyzed pairs of studies together. We
found that there was a moderate correlation (0.38) between the protein
reproducibility calculated using the ovarian tumor replicates and the
colon cancer replicates ([111]Figure 3A). Combinations of other pairs
of studies revealed similar moderate correlations: colon and CCLE
(0.31) and ovarian and CCLE (0.24) ([112]Figures 3B and 3C). Although
the nature of the samples (tumor versus cell line) and the
quantification approaches (TMT/label-free quantification) varied across
studies, this suggests that there is some agreement in terms of which
proteins can be reproducibly quantified. In general, proteins that are
highly reproducible in one study tend to be highly reproducible in
others, while proteins that show poor reproducibility in one study tend
to show poor reproducibility in others ([113]Figure 3). For example,
GBP1 is one of the proteins with reproducibility that is consistently
high across all three studies ([114]Figure 3D), while RPS29 has
consistent low reproducibility ([115]Figure 3E).
Figure 3.
[116]Figure 3
[117]Open in a new tab
Proteins with high reproducibility in one study are also highly
reproducible in other studies
(A–C) Binned heatmaps showing the relationship between the
protein-protein reproducibility calculated in different studies. Each
heatmap shows the relationship between two studies, indicated on the x
and y axes. The regions of the heatmaps are colored according to the
number of proteins present in the region as indicated in the color bar.
The number of proteins in common and Spearman correlation between the
two studies, with the associated p value, are specified in the box for
each of the plots.
(D and E) For each study with experimental protein replicates, scatter
plots illustrating the relationship between protein-protein
reproducibility are shown for a protein with high reproducibility, GBP1
(D), and a protein with low reproducibility, RPS29 (E). For each
scatter plot, the Spearman correlation coefficient of the
protein-protein reproducibility and the associated p value is indicated
at the bottom.
An integrated ranking of protein reproducibility partially explains the
variable mRNA-protein correlation in 10 additional studies
Proteogenomic studies with large numbers of replicates, such as the
three we analyzed above, are the exception rather than the rule.
Consequently, for most studies, we do not know how reproducible the
proteomic measurements are. However, as noted above, proteins that are
highly reproducibly quantified in one study are more likely to be
highly reproducible in others. We therefore sought to aggregate the
replicate protein correlations from all three studies (CCLE, ovarian,
colon) into a single list containing a ranking of protein
reproducibility ([118]STAR Methods; [119]Figure S2A; [120]Table S2). We
evaluated a number of different aggregation approaches and found that a
simple method using average normalized rank explained the most variance
in mRNA-protein correlations of the three studies containing proteomic
replicates ([121]STAR Methods; [122]Figure S2B). We used this approach
to create a ranked order of protein reproducibility for the 5,211
proteins that were quantified in at least two out of the three studies.
We then used this aggregated list to assess the extent to which
“average” protein reproducibility explains the varying mRNA-protein
correlations observed in ten other studies ([123]Clark et al., 2019;
[124]Dou et al., 2020; [125]Gillette et al., 2020; [126]Guo et al.,
2019; [127]Huang et al., 2021; [128]Jiang et al., 2020; [129]Krug
et al., 2020; [130]Mertins et al., 2016; [131]Wang et al., 2021;
[132]Zhang et al., 2014) ([133]Figure 4). For all these studies, we
find that proteins with more reproducible measurements tend to have
higher mRNA-protein correlations. Although the aggregated ranks are
based on data from cancer studies, we observe the same trend in healthy
tissues obtained from the GTEx project ([134]Figure 4J). Similarly,
although the aggregated ranks are generated using studies that quantify
proteins through data-dependent acquisition (DDA) approaches, we
observed the same trend for a study that quantified proteins using
data-independent acquisition (DIA)-based proteomics (sequential window
acquisition of all theoretical mass spectra [SWATH-MS]) in the NCI-60
cancer cell lines ([135]Figure 4I). In general, the mRNA-protein
correlation increases with protein reproducibility for samples from
both healthy and diseased conditions and irrespective of the proteomic
quantification approach.
Figure 4.
[136]Figure 4
[137]Open in a new tab
Aggregated protein reproducibility ranks partially explains the
variable mRNA-protein correlation in 10 additional studies
(A–J) For studies without experimental proteomic replicates, boxplots
showing the distributions of mRNA-protein correlation for proteins in
each decile of the aggregated protein reproducibility ranks. (A)–(H)
are the CPTAC tumor studies; (I) is the NCI-60 cancer cell lines study
wherein protein quantification, used for computing the mRNA-protein
correlation, is obtained from data-independent acquisition-based
untargeted proteomics (SWATH-MS); and (J) is the healthy tissues study
from the GTEx Consortium. Box plot details as in [138]Figure 2.
To quantify the amount of variation in mRNA-protein correlation that
could be explained by our aggregated protein reproducibility ranks, we
used a linear regression model for the ten different studies. We found
that the aggregated ranks explain ∼10%–20% (median 14%) of the
variation in these studies ([139]Figure 4).
To test if there was an advantage to using the aggregate protein
reproducibility over protein reproducibility measured in either of the
three individual studies (CCLE, ovarian, colon), we compared the
variance explained by the aggregate ranks with that explained by each
individual study. In all ten studies without proteomic replicates, the
aggregated ranks explained the variation in mRNA-protein correlation
better than the ranks from any individual dataset ([140]Figure S3).
A number of efforts have been made to use machine learning to predict
protein abundances from mRNA abundances ([141]Fortelny et al., 2017;
[142]Li et al., 2019; [143]Yang et al., 2020). Recently, the NCI-CPTAC
DREAM proteogenomics challenge engaged the community to predict protein
abundances of breast and ovarian tumor profiles using their
corresponding genomic and transcriptomic information ([144]Yang et al.,
2020). We hypothesized that proteins whose measurements are highly
reproducible could be predicted better using machine-learning
algorithms. Hence, we analyzed the prediction scores from the
best-performing model using the protein reproducibility data. We
observed a stark difference in the prediction scores of the lowest and
highest deciles of the protein reproducibility ([145]Figures S4A and
S4B). While the lowest decile has a correlation of ∼0.35 between the
measurements and predictions, the highest decile has a correlation of
∼0.7. The aggregated protein reproducibility ranks could explain ∼25%
and 26% of the variation in the prediction scores of breast and ovarian
cancer studies, respectively, again outperforming the reproducibility
measured in any individual study ([146]Figure S4C).
Protein measurement reproducibility is influenced by abundance, variance, and
unique peptides
To understand what causes some proteins to be more reproducibly
measured than others, we analyzed a number of factors that we
hypothesized might influence the reliability of their measurements.
All of the studies analyzed here make use of “bottom-up” quantification
approaches where proteins are first digested into peptides; these
peptides are then quantified using a mass spectrometer, and peptide
quantifications are converted into protein abundances computationally.
This quantification is a stochastic process, and there is no guarantee
that every peptide in a given sample will be detected by the mass
spectrometer. The quantification of proteins that have low abundance,
and hence fewer detectable peptides, is especially likely to be subject
to substantial stochastic variation. A small number of peptides missed
can make a big difference to the quantification of these low abundance
proteins, while for highly abundant proteins, a few extra or missing
peptides will make only a small difference. To assess the contribution
of protein abundance to protein measurement reproducibility, we
obtained the protein abundances measured in 201 tissue samples from 32
healthy human tissues collected by the GTEx project ([147]Jiang et al.,
2020). For each protein, we calculated the mean abundance across all
samples and tissues. We found a clear relationship between the mean
protein abundance and the aggregated protein reproducibility rank—more
abundant proteins are more reproducibly measured ([148]Figure 5A). We
performed a similar analysis for the three individual proteomic
replicate studies and found the result to be consistent
([149]Figures S5A–S5C).
Figure 5.
[150]Figure 5
[151]Open in a new tab
Protein reproducibility is mainly influenced by abundance, variance,
and unique peptides and not protein half-lives
(A–C) Boxplots showing the distribution of aggregated protein
reproducibility ranks for proteins binned according to protein
abundance (A), variance (B), and number of unique peptides (C). Box
plot details as in [152]Figure 2.
(D) Boxplot showing the distribution of aggregated protein
reproducibility ranks for proteins with short and long protein
half-lives.
Proteins whose abundances do not vary significantly across individuals
are unlikely to have high mRNA-protein correlations, as correlation
measures are dependent on there being meaningful variation in the data.
Furthermore, as the variation observed experimentally is likely a
combination of both real biological variation and experimental noise,
proteins with lower biological variation in abundance will tend to be
more affected by measurement noise. For each protein, we computed the
variance in protein abundance across samples from the GTEx project
([153]Jiang et al., 2020). We then assessed the influence of this
variance on the reproducibility of measurements of individual proteins.
Similar to the mean protein abundance above, we found that proteins
with a higher variance of protein abundance are more reproducibly
measured ([154]Figure 5B). Furthermore, the variance of protein
abundance explains ∼20% of the variation in the aggregated protein
reproducibility ranks. Similar trends were observed for the three
individual proteomic replicate studies ([155]Figures S5A–S5C).
The number of unique peptides generated per protein is also crucial for
protein quantification by MS. To assess the impact of this, we
identified the number of unique peptides identified per protein using
the GTEx study. We stratified all proteins into deciles based on the
number of unique peptides identified and found that the aggregated
protein reproducibility increased with every decile of unique peptides
identified ([156]Figure 5C). This pattern was also evident in the
protein reproducibility measured in each of the three individual
studies ([157]Figures S5A–S5C). Thus, the more unique peptides
identified per protein, the higher the confidence of the measured
protein levels.
One of the biological reasons proposed for the weak mRNA-protein
correlation is the difference in mRNA and protein half-lives
([158]Vogel and Marcotte, 2012). mRNAs typically have a half-life of
2.6–7 h, while proteins have half-lives ranging from a few seconds to a
few days ([159]Vogel and Marcotte, 2012). Recently, proteins with
longer half-lives were found to be more predictable using machine
learning, irrespective of the transcript half-lives ([160]Yang et al.,
2020). This led us to assess protein half-life as a potential factor
for the reproducibility of protein measurements. We obtained protein
half-lives estimations from a previous publication ([161]Zecha et al.,
2018) and divided them into two categories—long and short half-lives
([162]STAR Methods)—as was done in [163]Yang et al. (2020). Although
both categories contain proteins with reproducibility scores ranging
from 0 to 1, proteins with a long half-life have a higher median
protein reproducibility score (p = 9.70e−25, Mann-Whitney U test,
two-sided; [164]Figures 5D and [165]S5A–S5C).
We note that there is some correlation between the attributes
considered, in particular more abundant proteins tend to have more
unique peptides identified. To understand the relative contribution of
each factor, we performed rank regression by using the individual
factors as the explanatory variables and the ranks of the proteomic
reproducibility as the response variable ([166]STAR Methods). We found
in all cases that a model including all four factors performed better
than a model including only the best individual factor, suggesting that
variance in reproducibility can best be explained by a combination of
factors ([167]Figure S5D).
The factors above all contribute to protein-protein reproducibility,
raising the question of whether they themselves might be sufficient to
explain variation in mRNA-protein correlation. To assess this, we
performed linear regression with these factors (abundance, variance,
unique peptides, and protein half-lives) as explanatory variables and
the mRNA-protein correlation of each of the 13 different studies as
response variables. We found that a combined model of the factors
explained ∼3%–17% of the variation in mRNA-protein correlation of the
different studies ([168]Figure S6). However, the aggregated protein
reproducibility explains a considerably higher percentage of the
variation in mRNA-protein correlation in 12 of 13 studies. The GTEx
study is the lone exception, likely a result of the independent
variables (protein abundance, variance, number of unique peptides)
being calculated from the GTEx study itself ([169]Figure S6).
Transcriptomic reproducibility also contributes to the variance in
mRNA-protein correlation
Thus far, we have primarily focused on understanding the influence of
protein quantification reproducibility on mRNA-protein correlation.
However, it is also likely that the reproducibility of mRNA
measurements is an important factor in determining mRNA-protein
correlations. To assess the impact of transcriptomic reproducibility on
mRNA-protein correlation, we compared transcriptomic profiles for 382
cancer cell lines from the CCLE ([170]Ghandi et al., 2019) with those
generated in a separate profiling effort ([171]Klijn et al., 2015). We
find that the median gene-wise Spearman correlation across studies was
0.75 ([172]STAR Methods; [173]Figure 6A). Again, this varied
significantly across transcripts, ranging from −0.05 to 0.96. As with
protein reproducibility, we find that transcriptomic reproducibility is
influenced by both mRNA abundance and variance ([174]STAR Methods;
[175]Figure S5E).
Figure 6.
[176]Figure 6
[177]Open in a new tab
Transcriptomic reproducibility contributes to the variance in
mRNA-protein correlation
(A) Histogram showing the distribution of the gene-wise correlation
between experimental transcriptomic replicates of 382 cancer cell
lines. The black line represents the median.
(B) For each of the 13 studies analyzed here, the R-squared obtained by
regressing mRNA-protein correlation on transcriptomic reproducibility
and aggregated protein reproducibility scores individually and in
combination over the same set of proteins is shown in the dot plot. The
number of proteins analyzed for each study is indicated in brackets
below the study on the y axis.
We used a linear regression model to quantify, in all thirteen
proteogenomic studies, how much of the variation in mRNA-protein
correlation could be explained by transcriptomic reproducibility. We
found that the median variance explained was 15%. In most studies
(8/13), our aggregated protein reproducibility explained a higher
proportion of the variance than the mRNA reproducibility
([178]Figure 6B).
Compared with the other studies, the CCLE study had a strikingly higher
percentage of variance explained by transcriptomic reproducibility
(40%). This is presumably because there is a large overlap in the set
of samples used to compute the transcriptomic reproducibility and the
CCLE mRNA-protein correlation, unlike the other studies. For the CCLE,
the variance explained by mRNA-mRNA reproducibility is higher than the
variance explained by protein-protein reproducibility. However, the
mRNA-mRNA reproducibility was estimated using a much higher number of
cell lines (382 versus 18 for protein-protein reproducibility), which
we reasoned could explain the increased variance explained. To test
this hypothesis, we downsampled the available transcriptomic data to
make the comparison more equal (sampling 18 cell lines with
transcriptomes at random; [179]STAR Methods). We found that, using this
approach, the protein-protein reproducibility explained more of the
mRNA-protein variability than the mRNA-mRNA reproducibility (on
average, ∼2.8 times). This suggests that protein-protein
reproducibility may influence mRNA-protein correlation more than
mRNA-mRNA reproducibility does but that 18 cell lines is not sufficient
to obtain a robust estimate of protein-protein reproducibility.
The Spearman correlation between aggregated protein reproducibility and
CCLE transcriptomic reproducibility is 0.37 across 4,795 proteins. This
suggests that there is some agreement between the reproducibility of
proteins and transcripts and that, to some extent, proteins that are
reproducibly measured are encoded by transcripts that are more
reproducibly measured. To assess if both mRNA and protein
reproducibility independently contribute to the variability of
mRNA-protein correlation across all 13 studies, we used a linear model
with the two factors as independent variables and mRNA-protein
correlation as the dependent variable. We found that in all cases, the
two factors together explained a higher proportion of variance than
either factor alone (p < 0.001, likelihood ratio test). In the case of
the CCLE study (used to calculate the mRNA reproducibility and one of
the three studies used to calculate protein reproducibility), the two
factors together explained 48% of the variance. For the 12 other
studies, the two factors together explained ∼14%–26% of the variance
([180]Figure 6B). These observations suggest that the reproducibility
in transcriptomic and proteomic data contribute strongly and somewhat
independently to the variability observed in mRNA-protein correlation.
Metabolic pathways with higher-than-average mRNA-protein correlations may
reflect differential reproducibility rather than differential
post-transcriptional regulation
Previous work has found that certain pathways and processes are
enriched in proteins that have higher- or lower-than-average
mRNA-protein correlations. For instance, ribosomal subunits have been
found to have consistently lower-than-average mRNA-protein correlations
across multiple studies ([181]Clark et al., 2019; [182]Mertins et al.,
2016; [183]Zhang et al., 2014, [184]2016), while members of pathways
related to amino acid metabolism have been found to have
higher-than-average mRNA-protein correlation ([185]Clark et al., 2019;
[186]Huang et al., 2021; [187]Jarnuczak et al., 2021; [188]Mertins
et al., 2016; [189]Zhang et al., 2014, [190]2016). This variation
across functional groups has been attributed to differential
post-transcriptional regulation. However, our observation that both
protein-protein measurement reproducibility and mRNA-mRNA measurement
reproducibility contribute significantly to the variation in
mRNA-protein correlation across genes suggests an alternative
explanation—some pathways may have higher- or lower-than-average
mRNA-protein correlations simply because their component proteins are
more reproducibly measured. To test this hypothesis, we first performed
pathway enrichment analysis on the mRNA-protein correlations from the
CCLE and ovarian datasets ([191]STAR Methods; [192]Figures 7 and
[193]S7). Consistent with previous studies, we observed that proteins
with high mRNA-protein correlations are enriched in gene sets involved
in environmental information processing and metabolic pathways, while
proteins with low mRNA-protein correlations are enriched in annotations
related to housekeeping protein complexes ([194]Figure 7;
[195]Tables S3 and [196]S4). To assess whether these enrichments could
simply be attributed to variable reproducibility, we next performed
pathway enrichment analysis on the CCLE and ovarian mRNA-protein
correlation data after accounting for variation in protein-protein and
mRNA-mRNA reproducibility ([197]STAR Methods). We found in both studies
that the “housekeeping” protein complexes were still identified as
being enriched among proteins with lower-than-average mRNA-protein
correlations but that the metabolic pathways were no longer enriched in
proteins with higher-than-average mRNA-protein correlations
([198]Figures 7 and [199]S7; [200]Tables S3 and [201]S4). Other
pathways with higher-than-average mRNA-protein correlations related to
environmental information processing were also no longer significant
after adjusting for reproducibility. This suggests that while large
housekeeping protein complexes such as the ribosome have
lower-than-average mRNA-protein correlation that may be attributed to
post-transcriptional mechanisms, the higher-than-average mRNA-protein
correlation previously observed for metabolic pathways may simply
reflect more reproducible measurements of their constituent proteins
and transcripts.
Figure 7.
[202]Figure 7
[203]Open in a new tab
Metabolic pathways with higher-than-average mRNA-protein correlations
may reflect differential reproducibility
Bar charts displaying the KEGG pathway enrichment analysis of the CCLE
mRNA-protein correlation before (left) and after (right) accounting for
protein-protein and mRNA-mRNA reproducibility. The −log[10] of
Benjamini-Hochberg false discovery rate (FDR)-corrected p values
calculated using Mann-Whitney U test is used to assess enrichment for
the pathway. For each bar chart, the gray line indicates the threshold
considered for significant enrichment (FDR < 0.05). If the enrichment
is below the threshold, then it is not considered significant. The bars
are colored orange if the median mRNA-protein correlation of genes
within the pathway is greater than the median mRNA-protein correlation
of genes not in the pathway; otherwise, the bars are colored blue.
Discussion
Here, we have demonstrated that the reproducibility of protein and
transcript measurements is a very significant factor in the observed
correlations between mRNA and protein abundances. After taking this
into account, we found that some pathways previously identified as
having a high mRNA-protein correlation are likely just more
reproducibly measured. We therefore suggest that conclusions about
functional groups with higher or lower mRNA-protein correlations,
especially with regard to the potential role played by
post-transcriptional regulation, should be made only after accounting
for variation in the measurement reproducibility of their constituent
proteins. To this end, we have generated an aggregate protein
reproducibility rank for each protein that can explain a significant
amount of the variance across multiple proteogenomic studies and that
may be useful for identifying those proteins that can be reliably and
reproducibly measured by mass spectrometry. Such proteins may be more
useful to assay in, e.g., diagnostic panels.
Recently, there have been a number of attempts to predict protein
abundances from transcriptomic data that have achieved modest success
([204]Barzine et al., 2020; [205]Fortelny et al., 2017; [206]Li et al.,
2019; [207]Yang et al., 2020). We found here that proteins that are
more reproducibly measured across experimental replicates are better
predicted using machine-learning. This suggests that one of the factors
limiting the accuracy of machine-learning methods to predict protein
abundances is that the protein abundance measurements themselves are
not reproducible. It may therefore be worth evaluating future methods
on the subset of proteins that can be reproducibly measured.
Limitations of the study
Our emphasis here has been on understanding how variability in the
measurements of individual proteins can influence the mRNA-protein
correlations observed in published tumor proteogenomic studies. We have
shown that proteins/transcripts that are more reproducibly measured
tend to have higher mRNA-protein correlations, and we have identified a
number of factors (e.g., protein abundance) that influence variation in
measurement reproducibility. There are of course additional factors
that influence the global reproducibility of proteomes and
transcriptomes quantified from “replicates” of the same sample. These
include real biological variation (e.g., tumor heterogeneity resulting
in two samples of the same tumor having different profiles) and
technical variation (e.g., variation in sample preparation between
different runs of the same sample). We have not been able to address
how much of the variance in the measurements of individual proteins can
be attributed to these global factors. It is likely that reducing these
sources of global variation, e.g., through automated sample
preparation, will improve the overall reproducibility of protein
measurements. We note also that our analyses do not reflect the best
possible reproducibility of proteomic and transcriptomic measurements,
but rather they reflect the reproducibility observed in existing
large-scale proteogenomic datasets. Indeed, we see that more recent
proteogenomic studies have higher mRNA-protein correlations, suggesting
that methodological improvements are already reducing the sources of
noise in these approaches.
Our results from analyzing the CCLE dataset, where the replicate
correlation is highest, give what is likely the most realistic
assessment of proteomic reproducibility using modern MS pipelines. The
replicates in the CCLE study were generated by the same lab, using the
same methodology, 1 year apart. In contrast, the ovarian cancer study
contains replicates generated in different labs (introducing
significant measurement heterogeneity), while the colon cancer study
makes use of replicates generated using two entirely different MS
approaches (label free versus TMT). Consequently, these likely
represent lower-bound estimates of the reproducibility that can be
observed using modern MS proteomic pipelines. Nonetheless, they likely
reasonably approximate the non-biological heterogeneity observed
between transcriptomes and proteomes in the studies analyzed, where
mRNA and proteins are quantified separately using orthogonal
techniques.
In the case of the CCLE data, we have a small number (18) of samples
with replicate proteomic profiles available and a larger number (382)
with replicate transcriptomes. However, only 8 samples have both
replicate transcriptomes and replicate proteomes. Having a larger
number of samples with both replicate proteomes and replicate
transcriptomes would allow us to better estimate the actual correlation
between mRNA and protein across samples after correction for
measurement error, as has previously been done to estimate the true
"within sample" mRNA-protein correlation in yeast ([208]Csárdi et al.,
2015; [209]Spearman, 1904).
Here, we have shown that a number of factors measured using the GTEx
dataset, including measured protein abundance, measured protein
variance, and measured unique peptides, are associated with proteomic
reproducibility in cancer studies. Across studies, a model that
incorporates all three factors outperformed models using each variable
alone. However, this may be because all three represent imperfect
measurements of the same underlying variable—real average protein
abundance. Previous work has demonstrated that statistical modeling
that integrates multiple mRNA and protein datasets and explicitly takes
into account different sources of noise and error can be used to
provide improved estimates of mRNA-protein correlation within samples
([210]Csárdi et al., 2015). As additional studies with proteomic
replicates and transcriptomic replicates become available, it may be
possible to develop improved models that provide more reliable
estimates of protein reproducibility and the factors that influence it.
Such estimates could be improved through the incorporation of
additional estimates of average protein abundance and variation (e.g.,
from [211]Wang et al., 2019).
STAR★Methods
Key resources table
REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data
__________________________________________________________________
cBioPortal [212]Cerami et al. (2012); [213]Gao et al. (2013)
[214]https://www.cbioportal.org/; RRID:[215]SCR_014555
Cancer Dependency Map (DepMap) 20Q4 [216]Ghandi et al. (2019)
[217]https://depmap.org/portal/ccle/;
[218]https://figshare.com/articles/dataset/DepMap_20Q4_Public/13237076;
RRID:[219]SCR_017655
LinkedOmics [220]Vasaikar et al. (2018)
[221]http://www.linkedomics.org/
CPTAC Python API [222]Lindgren et al. (2021)
[223]https://pypi.org/project/cptac/
CORUM 3.0 [224]Giurgiu et al. (2019)
[225]http://mips.helmholtz-muenchen.de/corum/; RRID:[226]SCR_002254
KEGG Pathway [227]Kanehisa (2019); [228]Kanehisa and Goto (2000);
[229]Kanehisa et al. (2021)
[230]https://www.genome.jp/kegg/pathway.html; RRID:[231]SCR_018145
Colorectal cancer transcriptomics [232]Cancer Genome Atlas Network
(2012)
[233]https://cbioportal-datahub.s3.amazonaws.com/coadread_tcga_pub.tar.
gz
Colorectal cancer proteomics [234]Zhang et al. (2014) Published
supplemental Table S4
Ovarian cancer transcriptomics [235]Cancer Genome Atlas Research
Network (2011)
[236]http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/OV/20
160128/gdac.broadinstitute.org_OV.mRNA_Preprocess_Median.Level_3.201601
2800.0.0.tar.gz
Ovarian cancer proteomics [237]Zhang et al. (2016) Published
supplemental Table S2
Breast Cancer (2016) transcriptomics [238]Ciriello et al. (2015)
[239]https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.
gz
Breast Cancer (2016) proteomics [240]Mertins et al. (2016) Published
supplemental Table S3
Colon Cancer [241]Vasaikar et al. (2019)
[242]http://linkedomics.org/cptac-colon/
Clear cell renal carcinoma [243]Clark et al. (2019)
[244]https://pypi.org/project/cptac/
Breast Cancer (2020) [245]Krug et al. (2020)
[246]https://pypi.org/project/cptac/
Endometrial Cancer [247]Dou et al. (2020)
[248]https://pypi.org/project/cptac/
Lung Adenocarcinoma [249]Gillette et al. (2020)
[250]https://pypi.org/project/cptac/
Head and Neck Squamous Cell Carcinoma [251]Huang et al. (2021)
[252]https://pypi.org/project/cptac/
Glioblastoma [253]Wang et al. (2021)
[254]https://pypi.org/project/cptac/
NCI60 cancer cell lines [255]Guo et al. (2019) Published supplemental
Tables S6 and S1
Cancer Cell Line Encyclopedia (CCLE) transcriptomics [256]Ghandi et al.
(2019) [257]https://depmap.org/portal/ccle/; RRID:[258]SCR_013836
CCLE proteomics [259]Nusinow et al. (2020) Published supplemental
Tables S2 and S3;
[260]https://gygi.hms.harvard.edu/publications/ccle.html
GTEx healthy tissues [261]Jiang et al. (2020) Published supplemental
Tables S3 and S4
RNA-seq of 675 commonly used human cancer cell lines [262]Klijn et al.
(2015) ArrayExpress: E-MTAB-2706
Protein half-life [263]Zecha et al. (2018) Published supplemental
Table S3
NCI CPTAC DREAM Proteogenomics challenge prediction scores of the best
performing model (Team Guan) [264]Yang et al. (2020)
[265]https://heidelberg.shinyapps.io/proteoexplorer/
__________________________________________________________________
Software and algorithms
__________________________________________________________________
All analysis code This study
[266]https://github.com/cancergenetics/limitations_of_omics_reproducibi
lity; [267]https://doi.org/10.5281/zenodo.6956546
Python version 3.8 Python Software Foundation
[268]https://www.python.org/; RRID:[269]SCR_008394
Pandas 1.2.5 [270]McKinney (2011) [271]https://pandas.pydata.org/;
RRID:[272]SCR_018214
Numpy 1.20.2 [273]Harris et al. (2020) [274]https://numpy.org/;
RRID:[275]SCR_008633
StatsModels 0.12.2 [276]Seabold and Perktold (2010)
[277]https://www.statsmodels.org/stable/index.html;
RRID:[278]SCR_016074
SciPy 1.7.1 [279]Virtanen et al. (2020) [280]https://www.scipy.org/;
RRID:[281]SCR_008058
Matplotlib 3.3.4 [282]Hunter (2007) [283]https://matplotlib.org/; RRID:
[284]SCR_008624
Seaborn 0.11.0 [285]Waskom (2021) [286]https://seaborn.pydata.org/;
RRID:[287]SCR_018132
[288]Open in a new tab
Resource availability
Lead contact
Further information and requests for resources should be directed to
and will be fulfilled by the lead contact, Colm Ryan
([289]colm.ryan@ucd.ie).
Materials availability
This study did not generate new materials.
Method details
Data collection
The datasets analysed were downloaded from the links provided in the
[290]key resources table.
For studies ([291]Clark et al., 2019; [292]Dou et al., 2020;
[293]Gillette et al., 2020; [294]Huang et al., 2021; [295]Krug et al.,
2020; [296]Wang et al., 2021) both the transcriptomic and proteomic
profiles were obtained from the CPTAC API ([297]Lindgren et al., 2021).
For colorectal ([298]Zhang et al., 2014) and breast cancer
([299]Mertins et al., 2016) studies, the transcriptomic data were
downloaded from cBioPortal while proteomic data was obtained from the
supplemental materials. For the ovarian cancer study ([300]Zhang
et al., 2016), the transcriptomic data were downloaded from the
[301]https://gdac.broadinstitute.org/ and proteomic data from the
supplemental materials. For colon cancer ([302]Vasaikar et al., 2019),
GTEX ([303]Jiang et al., 2020) and NCI60 ([304]Guo et al., 2019) cancer
cell lines studies, both the transcriptomic and proteomic data were
obtained from the supplemental tables. For CCLE study, the
transcriptomic data was downloaded from the cancer dependency map
portal ([305]https://depmap.org/portal/ccle/) and proteomic data was
downloaded from the supplemental materials.
Pre-processing proteomic and transcriptomic profiles
Proteomics and transcriptomics data were obtained from the studies
listed in the [306]Key resources table. The proteomics datasets
contained a considerable number of missing values, identified as NaNs
in most studies or 0s in ([307]Zhang et al., 2014). Within each study
we restricted our analyses to proteins that were measured in at least
80% of samples. The same filtering was applied to transcriptomics,
requiring transcripts to be measured in 80% of samples. In some
datasets, multiple protein isoforms from the same gene were available,
we aggregated these using the mean to calculate a ‘gene level’ summary.
The CCLE study repeatedly profiled two 10-plexes (18 cell lines) one
year apart in order to assess the reproducibility of the proteomic
profiling. These replicates are used to perform the assessment of the
reproducibility of protein measurements presented in [308]Figure 1. In
addition to these 18 cell lines, 3 cell lines were screened in
duplicate as part of standard 10-plex runs. As suggested in the CCLE
guide ([309]Nusinow and Gygi, 2020) for these three cell lines we
selected the profiles which correlate best with the transcriptomic data
for our analyses here.
Computation of correlation coefficient
All data was processed through the standard pipeline described above
before computing correlation. Correlation between (i) mRNA-protein,
(ii) protein-protein and (iii) mRNA-mRNA was computed using the
Spearman rank correlation. For each protein in each study, samples with
missing values were ignored when computing the correlation.
Assessing proteomic and transcriptomic reproducibility
The quantitative proteomics of the CCLE ([310]Nusinow et al., 2020)
data contained three replicates of the proteomic profiles. In the first
year, 18 cell lines (two 10-plexes) were quantified (R1). The same cell
lines were quantified twice (R2, R3) the following year. The
correlation between replicates: R1-R2, R1-R3 and R2-R3 were 0.7, 0.71
and 0.88 respectively. We chose to use the R1 and R3 proteomic profiles
to compute the replicate correlation as R1-R3 has the median
correlation out of the three replicate pairs.
To assess the reproducibility of transcriptomic data we considered two
studies that had quantified transcripts in tumour-derived cell lines.
One of the studies chosen was the CCLE transcriptomic study for which
we have previously assessed the mRNA-protein correlation. The CCLE
transcriptomic study ([311]Ghandi et al., 2019) had profiled 1076 and
([312]Klijn et al., 2015) had profiled 675 cancer cell lines using
RNA-Seq. These two studies had quantified the transcripts in different
labs in different years. However, the two studies had 382 cell lines
and 13,226 genes in common. The transcriptomic reproducibility was
computed using the Spearman rank correlation coefficient for the
transcriptomic measurements across the 382 common cell lines of the
studies. The standard pipeline for pre-processing was applied before
assessing the reproducibility of the transcriptomic studies.
While the CCLE transcriptomic reproducibility was computed using 382
cell lines, the CCLE proteomic reproducibility was computed using 18
cell lines only. The common cell lines between the transcriptomic and
proteomic replicates were <10. Therefore, to compare the predictive
power of transcriptomic reproducibility and proteomic reproducibility
in explaining the variation in mRNA-protein correlation of the
different studies, the transcriptomic reproducibility was computed for
18 random cell lines over 100 iterations. The transcriptomic
reproducibility was then used to predict the mRNA-protein correlation
of the thirteen proteogenomic studies. For each study, the mean R^2
obtained across all 100 random cell line selections was then used to
compare the predictive power of transcriptomic reproducibility and
proteomic reproducibility over the same number of proteins.
Computation of deciles
Deciles were computed using the pandas qcut method. Each decile
contains ∼10% of the overall number of items to be stratified. In some
cases, due to ties, these deciles are not uniformly sized.
Protein complex membership
Information on protein complex membership was obtained from CORUM
([313]Giurgiu et al., 2019) (all complexes data). A protein was marked
as a protein complex subunit if it is identified in CORUM data.
Protein half-lives
The half-lives of proteins were obtained from ([314]Zecha et al., 2018)
study. The median half-life of all proteins from the list was computed.
Proteins with half-lives > median were encoded to have ‘long’ half-life
while the others were encoded to have ‘short’ half-life.
Rank aggregation
For each of the three proteomic studies with replicates (ovarian,
colon, CCLE) ranks were assigned based on increasing correlation and
normalized by dividing over the total number of proteins in the
dataset. Only proteins that were measured in 2 out of the 3 datasets
were considered for the aggregated list. For proteins measured in only
2 studies, we imputed the third normalised rank as 0.5. For all
proteins, we then computed the mean rank as the aggregated rank of the
protein ([315]Figure S2A).
We compared the aggregated list of proteins obtained through our method
of aggregation ([316]Figure S2A) with other aggregated lists which we
calculated using other algorithms - robust rank aggregation ([317]Kolde
et al., 2012), Stuart ([318]Stuart et al., 2003), BordaFuse ([319]Aslam
and Montague, 2001) and, Markov Chain Aggregator (MC4) ([320]Dwork
et al., 2001). To assess the performance of different aggregation
methods, we used linear models wherein the mRNA-protein correlation of
the three studies containing replicate proteomic profiles was regressed
on the different aggregated lists of protein reproducibility. The
aggregated list using our ‘average normalized rank’ approach could best
explain the variation in mRNA-protein correlation in the colorectal
cancer and CCLE studies, while the BordaFuse method best explained the
variation in the ovarian cancer study ([321]Figure S2B), followed by
our approach. As our ‘average normalized rank’ approach overall has the
highest R-squared, we chose this method to aggregate the correlations
of proteomic replicate profiles.
Linear regression models
All linear regression was carried out using the statsmodel package in
Python.
Assessing the relationship between protein-protein reproducibility,
mRNA-mRNA reproducibility, and mRNA-protein correlation
([322]Figure 6B)
To understand the variance in mRNA-protein correlation explained by
protein-protein reproducibility and mRNA-mRNA reproducibility we used
three different linear models given by the equations -
* •
Protein-protein reproducibility only:
[MATH: c(g)
=α+β∗pa(g)
:MATH]
* •
mRNA-mRNA reproducibility only:
[MATH: c(g)
=α+β∗m(g) :MATH]
* •
Protein-protein reproducibility and mRNA-mRNA reproducibility:
[MATH: c(g)
=α+β1∗p(g)
+β2∗m(g)
:MATH]
where
[MATH: c(g) :MATH]
is the mRNA-protein correlation for each protein,
[MATH: pa(g) :MATH]
is the aggregated protein reproducibility rank for each protein,
[MATH: m(g) :MATH]
is the mRNA-mRNA reproducibility of the corresponding transcript of
each protein and the coefficients
[MATH: α,β,β1 :MATH]
and
[MATH: β2 :MATH]
are computed using the ordinary least squares regression method. For
all the models, mRNA-protein correlation is assessed over the same set
of proteins in each study. R^2 is used to assess the predictive power
of the explanatory variables in explaining the variation of the
response variable.
Assessing the relationship between protein-protein reproducibility and
mRNA-protein correlation ([323]Figure S1A)
To understand the variance in mRNA-protein correlation explained by
protein complex membership and protein-protein reproducibility we used
three different linear models given by the equations -
* •
Protein complex membership only:
[MATH: c(g)
=α+β∗pcm(g)
:MATH]
* •
protein-protein reproducibility only:
[MATH: c(g)
=α+β∗p(g) :MATH]
* •
Protein complex membership and protein-protein reproducibility:
[MATH: c(g)
=α+β1∗pcm(g)
+β2∗p(g)
:MATH]
where
[MATH: c(g) :MATH]
is the mRNA-protein correlation for each protein,
[MATH: pcm(g) :MATH]
is the protein complex membership for each protein,
[MATH:
p(g)
:MATH]
is the protein-protein correlation for each protein and the
coefficients
[MATH: α,β,β1 :MATH]
and
[MATH: β2 :MATH]
are computed using the ordinary least squares regression method. The
protein complex membership is indicated as 1 if a protein is a protein
complex member, else 0. For all the models, mRNA-protein correlation is
assessed over the same set of proteins in each study. R^2 is used to
assess the predictive power of the explanatory variables in explaining
the variation of the response variable.
Assessing the ability of different aggregation approaches to rank
protein-protein reproducibility ([324]Figure S2B)
To identify the best aggregation method for protein-protein
reproducibility, we compared the variance in mRNA-protein correlation
explained by different aggregation methods using linear models given by
the equations -
* •
Robust rank aggregation:
[MATH: c(g)
=α+β∗prr<
/mi>a(g) :MATH]
* •
Stuart aggregation method:
[MATH: c(g)
=α+β∗pst<
/mi>uart(g) :MATH]
* •
BordaFuse aggregation method:
[MATH: c(g)
=α+β∗pbf<
/mi>(g) :MATH]
* •
Markov chain aggregator 4:
[MATH: c(g)
=α+β∗pmc<
/mi>4(g) :MATH]
* •
Average normalized rank:
[MATH: c(g)
=α+β∗pa(g) :MATH]
where
[MATH: c(g) :MATH]
is the mRNA-protein correlation for each protein,
[MATH:
prra(g),pstu
art(g<
/mi>),pbf(g),pmc4
(g)
:MATH]
and
[MATH:
pa(g)<
/mo> :MATH]
are the aggregated protein reproducibility ranks computed using robust
rank aggregation, Stuart, BordaFuse, Markov chain aggregator 4 and
average normalized ranks respectively for each protein. The
coefficients
[MATH: α :MATH]
and
[MATH: β :MATH]
are computed using the ordinary least squares regression method. For
all the models, mRNA-protein correlation is assessed over the same set
of proteins in each study. R^2 is used to assess the predictive power
of the explanatory variables in explaining the variation of the
response variable.
Comparing the ability if aggregated rank reproducibility to predict
mRNA-protein correlation compared to reproducibility calculated in
individual studies ([325]Figure S3)
For each study, we compared four different models given by the
equations -
* •
Ovarian protein reproducibility rank:
[MATH: c(g)
=α+β∗pov<
/mi>arian(g) :MATH]
* •
CCLE protein reproducibility rank:
[MATH: c(g)
=α+β∗pcc<
/mi>le(g) :MATH]
* •
Colon protein reproducibility rank:
[MATH: c(g)
=α+β∗pco<
/mi>lon(g) :MATH]
* •
Aggregated protein reproducibility rank:
[MATH: c(g)
=α+β∗pa(g) :MATH]
where
[MATH: c(g) :MATH]
is the mRNA-protein correlation for each protein,
[MATH: pova
rian(<
/mo>g),pccl
e(g),pcol
on(g)<
/mo> :MATH]
and
[MATH:
pa(g)<
/mo> :MATH]
are the aggregated protein reproducibility computed using the ovarian,
CCLE and colon proteomic replicates individually and collectively
respectively for each protein. The coefficients α and β are computed
using the ordinary least squares regression method. For all the models,
mRNA-protein correlation is assessed over the same set of proteins in
each study. R^2 is used to assess the predictive power of the
explanatory variables in explaining the variation of the response
variable.
Assessing the impact of protein measurement reproducibility on the
accuracy of machine learning prediction of protein abundance
([326]Figure S4C)
To understand the variation in protein prediction scores that can be
explained by protein-protein reproducibility, we compared four
different models on prediction scores of breast and ovarian tumour
studies given by the equations -
* •
Ovarian protein reproducibility rank:
[MATH:
pscores(g)<
/mo>=α+β∗pov<
/mi>arian(g) :MATH]
* •
CCLE protein reproducibility rank:
[MATH:
pscores(g)<
/mo>=α+β∗pcc<
mi>le(g) :MATH]
* •
Colon protein reproducibility rank:
[MATH:
pscores(g)<
/mo>=α+β∗pco<
/mi>lon(g) :MATH]
* •
Aggregated protein reproducibility rank:
[MATH:
pscores(g)<
/mo>=α+β∗pa
(g) :MATH]
where
[MATH:
pscores(g) :MATH]
is the prediction score that is the Pearson correlation between the
predicted and actual protein abundance value obtained from the best
predicting model in NCI CPTAC Proteogenomics DREAM challenge,
[MATH:
povarian(g
),pccl
e(g),pcol
on(g)<
/mo> :MATH]
and
[MATH:
pa(g)<
/mo> :MATH]
are the aggregated protein reproducibility computed using the ovarian,
CCLE and colon proteomic replicates individually and collectively
respectively for each protein. The coefficients
[MATH: α :MATH]
and
[MATH: β :MATH]
are computed using the ordinary least squares regression method. For
all the models, protein reproducibility rank is assessed over the same
set of proteins in each study. R^2 is used to assess the predictive
power of the explanatory variables in explaining the variation of the
response variable.
Assessing the impact of mRNA abundance, mRNA variance on the
reproducibility of transcripts ([327]Figure S5E)
To understand the variation in mRNA reproducibility explained by the
potential factors (mRNA abundance, mRNA variance), we used two
different linear models given by the equations -
* •
mRNA abundance only:
[MATH: t(g)
=α+β∗mab<
/mi>undance(g) :MATH]
* •
mRNA variance only:
[MATH: t(g)
=α+β∗mva<
/mi>riance(g)
:MATH]
where
[MATH: t(g) :MATH]
is the transcript reproducibility correlation for each transcript,
[MATH: mabu
ndance(g) :MATH]
is the mRNA mean abundance for each transcript obtained from CCLE
transcriptomic data,
[MATH: mvar
iance(g) :MATH]
is the variance of the mRNA abundance for each transcript obtained from
CCLE transcriptomic data and the coefficients α and β are computed
using the ordinary least squares regression method.
Assessing the impact of protein abundance, protein variance, unique
peptides, protein half-lives and aggregated protein reproducibility on
mRNA-protein correlation ([328]Figure S6)
To understand the variance in mRNA-protein correlation explained by the
factors (protein abundance, protein variance, unique peptides, and
protein half-lives) influencing protein reproducibility, we used two
different linear models given by the equations -
* •
Other factors ():
[MATH: c(g)=α+β1∗pabu
ndance(g)+β2∗pva<
mi>riance(g)+β3∗ppe<
mi>ptides(g)+β4∗pha<
mi>lf−live
s−long(g)+β5∗pha<
mi>lf−live
s−shor
t(g)
:MATH]
* •
Aggregated protein reproducibility:
[MATH: c(g)
=α+β∗pa(g) :MATH]
where
[MATH: c(g) :MATH]
is the mRNA-protein correlation for each protein,
[MATH: pabu
ndance(g) :MATH]
is the protein abundance for each protein obtained from the GTEx
project,
[MATH: pvar
iance(g) :MATH]
is the variance of the protein abundance for each protein obtained from
the GTEx project,
[MATH: ppep
tides(g) :MATH]
is the number of unique peptides for each protein obtained from the
GTEx project,
[MATH:
phalf−lives−
long(g
) :MATH]
and
[MATH:
phalf−lives−
short(
g)
:MATH]
are the half-lives of each protein (long and short),
[MATH:
pa(g)<
/mo> :MATH]
are the aggregated protein reproducibility computed using the ovarian,
CCLE and colon proteomic replicates individually and collectively
respectively for each protein and the coefficients:
[MATH: α,β,β1,β2,β3,β4
:MATH]
and
[MATH: β5 :MATH]
are computed using the ordinary least squares regression method. For
all the models, mRNA-protein correlation is assessed over the same set
of proteins in each study. R^2 is used to assess the predictive power
of the explanatory variables in explaining the variation of the
response variable.
Rank regression
We used rank regression to assess the contribution of various factors
(protein abundance, unique peptides, and protein half-lives) to
explaining the variance in protein measurement reproducibility. We
assessed both the aggregated ranks and the reproducibility measured in
each individual study. We converted the protein reproducibility
measurements from the three studies with replicates (ovarian, colon,
CCLE) to ranks.
The potential factors such as protein abundance and unique peptides had
a large range, therefore both the factors were log transformed and
linear regression was performed.
Assessing the impact of protein abundance, protein variance, unique
peptides, protein half-lives on the reproducibility of proteins
([329]Figure S5D)
To understand the variance in protein reproducibility ranks explained
by the potential factors (protein abundance, protein variance, unique
peptides, and protein half-lives), we used four different linear models
given by the equations -
* •
Protein abundance only:
[MATH: rank(g)
=α+β∗pab<
/mi>undance(g) :MATH]
* •
Protein variance only:
[MATH: rank(g)
=α+β∗pva<
/mi>riance(g)
:MATH]
* •
Unique peptides only:
[MATH: rank(g)
=α+β∗ppe<
/mi>ptides(g)
:MATH]
* •
Protein half-lives encoded as long and short:
[MATH: rank(g)
=α+β1∗pha<
/mi>lf−lives−long
(g)+β2∗pha<
/mi>lf−lives−shor
t(g)
:MATH]
* •
Protein abundance, unique peptides and protein half-lives combined:
[MATH: rank(g)
=α+β1∗pab<
mi>undance(g)+β2∗pva<
/mi>riance(g)+β3∗ppe<
/mi>ptides(g)+β4∗pha<
/mi>lf−lives−long
(g)+β5∗pha<
/mi>lf−lives−shor
t(g)
:MATH]
where
[MATH: rank(g) :MATH]
is the protein reproducibility rank for each protein,
[MATH: pabu
ndance(g) :MATH]
is the protein abundance for each protein obtained from the GTEx
project,
[MATH: pvar
iance(g) :MATH]
is the variance of the protein abundance for each protein obtained from
the GTEx project,
[MATH: ppep
tides(g) :MATH]
is the number of unique peptides for each protein obtained from the
GTEx project,
[MATH:
phalf−lives−
long(g
) :MATH]
and
[MATH:
phalf−lives−
short(
g)
:MATH]
are the half-lives of each protein (long and short) and the
coefficients
[MATH: α,β,β1,β2,β3,β4
:MATH]
and
[MATH: β5 :MATH]
are computed using the ordinary least squares regression method. For
all the models, protein reproducibility is assessed over the same set
of proteins in each study. R^2 is used to assess the predictive power
of the explanatory variables in explaining the variation of the
response variable.
Pathway enrichment analysis
Pathway enrichment analysis was performed using the Mann-Whitney U
test. Firstly, the KEGG pathways ([330]Kanehisa et al., 2021) and their
associated genes for Homo sapiens were downloaded using the KEGG API
([331]https://www.kegg.jp/kegg/rest/keggapi.html). Only KEGG pathways
with more than 3 genes with measured correlations were included for the
enrichment analysis. The computed mRNA-protein correlations of CCLE and
ovarian cancer studies were used to rank the proteins. A Mann-Whitney U
test was performed to assess the rank of each pathway in each dataset
and p-values obtained were corrected for false discovery rate (FDR)
using the Benjamini-Hochberg method. For the figures presented in
[332]Figure 7 and [333]S7 we specifically included pathways which have
been previously identified as enriched in different cancer studies
([334]Clark et al., 2019; [335]Huang et al., 2021; [336]Mertins et al.,
2016; [337]Zhang et al., 2014, [338]2016). To identify enriched
pathways after accounting for experimental reproducibility, we
regressed the CCLE and ovarian mRNA-protein correlation on both
aggregated protein reproducibility ranks and mRNA-mRNA reproducibility
correlations, which are based on the equations
[MATH: c(g)=α+β1∗m(g)+β2∗p(g) :MATH]
, where
[MATH: c(g) :MATH]
is the mRNA-protein correlation,
[MATH: m(g) :MATH]
is the mRNA-mRNA reproducibility and
[MATH: p(g) :MATH]
is the protein-protein reproducibility and the coefficients
[MATH:
α,β1
:MATH]
and
[MATH: β2 :MATH]
are computed based on the ordinary least squares regression method. The
residuals obtained from the regression were used to rank the proteins
in pathway enrichment analysis. The top level categories (e.g.,
Metabolism, Genetic Information Processing) of the pathways were
obtained from KEGG and are used to annotate the pathways in
[339]Figure 7 and [340]S7.
Quantification and statistical analysis
Statistical analysis is described in the [341]Method details and was
carried out using Python 3.8, Pandas 1.2.5 ([342]McKinney, 2011), numpy
1.20.2 ([343]Harris et al., 2020), SciPy 1.7.1 ([344]Virtanen et al.,
2020) and StatsModels 0.12.2 ([345]Seabold and Perktold, 2010). The
figures were created with Matplotlib 3.3.4 ([346]Hunter, 2007) and
Seaborn 0.11.1 ([347]Waskom, 2021).
Acknowledgments