Graphical abstract

   graphic file with name fx1.jpg
   [25]Open in a new tab

Highlights

     * •
       The reproducibility of protein abundance measurements varies across
       proteins
     * •
       Measurement reproducibility contributes to variation in
       mRNA-protein correlations
     * •
       Aggregated protein reproducibility score integrates multiple
       reproducibility estimates
     * •
       Pathways with high mRNA-protein correlations may just be more
       reproducibly measured

Motivation

   Because they are easier to measure in high throughput, mRNA abundances
   are often used as a proxy measurement for protein abundances. However,
   there is only a moderate correlation between the two, and it is unclear
   to what extent this moderate correlation reflects post-transcriptional
   regulation and to what extent it can be attributed to measurement
   error. Here, by analyzing samples with replicate proteomes, we quantify
   the extent to which replicate measurements of the same proteins are
   correlated. We rank proteins according to their reproducibility and
   show that more reproducibly measured proteins have higher mRNA-protein
   correlation, suggesting that measurement error limits mRNA-protein
   correlation.
     __________________________________________________________________

   Upadhya and Ryan identified substantial variation in the
   reproducibility of protein abundance measurements across proteins. They
   develop an aggregate protein reproducibility score and show that this
   score explains variation in mRNA-protein correlations across multiple
   proteogenomic studies. They show that pathways with higher-than-average
   mRNA-protein correlations may simply be more reproducibly measured.

Introduction

   Proteins are the primary actors in our cells, responsible for almost
   all biological activities. Therefore, understanding how protein
   abundances vary between healthy and disease states can provide an
   insight into how biological activities are altered in disease
   conditions. Among patients with the same disease, e.g., breast cancer,
   variation in protein abundances may explain differences in survival
   outcomes ([26]Ősz et al., 2021) and drug responses ([27]Shenoy et al.,
   2020). Consequently, significant efforts have been made recently to
   characterize proteomes across large patient cohorts ([28]Ellis et al.,
   2013). However, our ability to quantify protein abundances at scale has
   lagged behind our ability to sequence genomes and quantify mRNA
   abundances. Large-scale efforts to molecularly characterize healthy and
   disease samples from humans have therefore primarily focused on DNA
   sequence variation and transcriptomic variation.

   As transcriptomes are easier to quantify than proteomes, mRNA
   abundances are often used as a proxy for protein abundances. However,
   the relationship between mRNA abundances and protein abundances is
   complex and non-linear and varies significantly from protein to
   protein. Consistent with this, large-scale studies in humans and model
   organisms have revealed that for most genes there is only a moderate
   correlation between mRNA and protein abundances ([29]Buccitelli and
   Selbach, 2020; [30]Vogel and Marcotte, 2012). We note that correlations
   between mRNA and protein abundances can be calculated in two different
   ways: across all proteins within a given sample (i.e., in a given cell
   line, are the most abundant proteins also the most abundant
   transcripts?) or for a single protein across multiple samples (i.e., do
   the samples with the highest levels of a specific protein also have the
   highest number of transcripts coding for that protein?) ([31]Franks
   et al., 2017; [32]Liu et al., 2016; [33]Vogel and Marcotte, 2012).
   Here, we are concerned with variation across individuals, and so
   throughout when we discuss mRNA-protein correlations, we are
   calculating the correlation between the protein and transcript
   abundance for an individual protein across samples.

   Tumor samples in particular have been subject to transcriptomic and
   proteomic profiling efforts, and these have provided insight into how
   variation in mRNA abundances across individuals is associated with
   variation in protein abundances across the same individuals. These
   studies have reported an average mRNA-protein correlation in the range
   of ∼0.2–0.5 ([34]Mertins et al., 2016; [35]Zhang et al., 2014,
   [36]2016). This moderate correlation between mRNA and protein
   abundances can be attributed to both biological and technical factors.
   Major biological factors that influence mRNA-protein correlation
   include translation rates that vary across proteins and conditions,
   highly variable half-lives for both proteins and mRNAs, and
   post-translational modifications that can alter protein stability and
   degradation ([37]Buccitelli and Selbach, 2020).

   Different proteins have been observed to have very different
   mRNA-protein correlations, and pathway enrichment analyses have
   identified specific functional groups with lower- or
   higher-than-average mRNA-protein correlations. For instance, a number
   of metabolic pathways have been shown to have higher-than average
   mRNA-protein correlations ([38]Clark et al., 2019; [39]Huang et al.,
   2021; [40]Jarnuczak et al., 2021; [41]Mertins et al., 2016; [42]Zhang
   et al., 2014, [43]2016), suggesting limited post-transcriptional
   regulation of these proteins. In contrast, subunits of large protein
   complexes have been shown to have lower-than-average mRNA-protein
   correlations, suggesting significant post-transcriptional regulation
   ([44]Gonçalves et al., 2017; [45]Ryan et al., 2017; [46]Taggart et al.,
   2020; [47]Wang et al., 2017; [48]Wu et al., 2013). Another factor that
   might influence mRNA-protein correlations across samples is the
   intrinsic variability in mRNA expression. mRNAs that do not vary across
   samples, such as those whose expression is usually tightly regulated,
   will not correlate with their corresponding proteins because variation
   is essential to observe correlation. As we focus our analysis on tumor
   profiles, where extensive copy-number alterations result in significant
   variation in mRNA abundances, this issue is a smaller concern.

   Our technical ability to accurately and reproducibly quantify both
   mRNAs and proteins is potentially a major factor that influences the
   mRNA-protein correlation. If the error in our measurements is large, we
   would expect this error to reduce the correlation between mRNA and
   protein even in the absence of the biological factors outlined above. A
   number of studies have separately assessed the reproducibility of
   either mRNA ([49]’t Hoen et al., 2013; [50]Marioni et al., 2008;
   [51]SEQC/MAQC-III Consortium, 2014) or proteomic ([52]Casey et al.,
   2017; [53]Tabb et al., 2010) profiling approaches. Others have explored
   how measurement errors in mRNA or proteomic profiling can influence the
   reported correlation between mRNA and protein abundances within sample
   correlations (across all proteins within a single sample/cell line)
   rather than across samples (for individual proteins across many
   samples) ([54]Csárdi et al., 2015; [55]Li et al., 2014). Here, we
   analyze studies of tumors and cancer cell lines with replicate
   proteomic profiles in order to assess the impact of measurement
   reproducibility on mRNA-protein correlation that can be observed for
   individual proteins across samples.

Results

A standardized pipeline reveals differences in the mRNA-protein correlation
across studies

   The average mRNA-protein correlation reported for different tumor
   proteomic profiling efforts varies substantially across studies—ranging
   from 0.23 in an early proteomic study of colorectal cancer ([56]Zhang
   et al., 2014) to 0.53 in a recent study of lung adenocarcinoma
   ([57]Gillette et al., 2020) ([58]Table 1). However, it is not
   meaningful to directly compare the reported correlations because the
   methods used to quantify the mRNA-protein correlation have varied
   across studies—different studies have used different summary statistics
   (mean versus median), different correlation metrics (Pearson versus
   Spearman), and different criteria for protein inclusion (e.g., no
   missing values, at least 30% measured values, only the 10% most
   variable proteins) ([59]Table 1). To enable a more direct comparison
   across studies, we calculated the mRNA-protein correlation for thirteen
   proteomic studies using a standardized pipeline. The datasets analyzed
   comprise ten studies of tumor samples ([60]Clark et al., 2019; [61]Dou
   et al., 2020; [62]Gillette et al., 2020; [63]Huang et al., 2021;
   [64]Krug et al., 2020; [65]Mertins et al., 2016; [66]Vasaikar et al.,
   2019; [67]Wang et al., 2021; [68]Zhang et al., 2014, [69]2016), two
   studies of cancer cell lines ([70]Guo et al., 2019; [71]Nusinow et al.,
   2020), and one study of healthy tissues ([72]Jiang et al., 2020).
   Within each study, we calculated the median Spearman correlation
   between mRNA and protein for all proteins that were measured in at
   least 80% of samples ([73]STAR Methods; [74]Tables 1 and [75]S1).
   Applying the same pipeline using Pearson correlation rather than
   Spearman correlation revealed broadly similar results ([76]Table 1),
   and so throughout the remainder of the paper, we focus our analysis on
   correlation calculated using Spearman correlation as it is the metric
   most commonly used in proteogenomic studies (9 of 13 studies).

Table 1.

   Analysis of mRNA-protein correlation using a standardized pipeline
   Data Published year Reported correlation Protein inclusion criterion in
   reported correlation Computed median Spearman correlation Computed
   median Pearson correlation
   GTEx 32 healthy tissues (GTEx) 2020 0.46 <5 tissues with missing values
   for both protein and RNA measurements 0.51 0.59
   Cancer Cell Line Encyclopaedia (CCLE) 2020 0.48 quantified in at least
   one ten-plex (9 cell lines) 0.46 0.48
   NCI-60 cancer cell lines (NCI60) 2019 not reported – 0.36 0.40
   Glioblastoma (GBM) 2021 not reported – 0.50 0.51
   Head and neck squamous cell carcinoma (HNSCC) 2021 0.52 <50% missing
   values 0.54 0.56
   Lung adenocarcinoma (LUAD) 2020 0.53 <50% missing values 0.55 0.56
   Endometrial cancer (EC) 2020 0.48 contain mRNA and protein measurements
   across all patients 0.48 0.51
   Breast cancer (BrCa 2020) 2020 0.41 contain mRNA and protein
   measurements (proteins <70% missing values) 0.44 0.43
   Clear cell renal carcinoma (ccRCC) 2019 0.43 contain mRNA and protein
   measurements across all patients 0.41 0.42
   Colon cancer (colon) 2019 0.48 top 10% most variably expressed proteins
   quantified in both platforms 0.27 0.28
   Ovarian cancer (ovarian) 2016 0.45 contain mRNA and protein
   measurements across all patients 0.41 0.41
   Breast cancer (BrCa 2016) 2016 0.39 contain mRNA and protein
   measurements across all patients passing quality control checks. 0.42
   0.42
   Colon and rectal cancer (CRC 2014) 2014 0.23 protein measurement with
   average spectral count across all patients ≥1.4 0.21 0.22
   [77]Open in a new tab

   Across all studies, the median recalculated correlation was 0.43 with a
   maximum of 0.55 (lung adenocarcinoma [LUAD]; [78]Gillette et al., 2020)
   and a minimum of 0.21 (colorectal cancer [CRC]; [79]Zhang et al.,
   2014). In some instances, the recalculated correlation was similar to
   that originally reported, but in others there was a substantial
   difference. For example, the correlation recalculated for endometrial
   cancer (0.48) was the same as originally reported ([80]Dou et al.,
   2020), while the recalculated correlation for colon cancer was much
   lower than that reported by the authors (0.27 versus 0.48)
   ([81]Vasaikar et al., 2019). This is because the colon cancer study
   reported the mean mRNA-protein correlation for only the 10% most
   variable proteins rather than the full set of proteins. These highly
   variable proteins have higher than average mRNA-protein correlations.

   More recent studies appear to have higher mRNA-protein correlations,
   e.g., we observe a mean of 0.49 for studies published after 2019 versus
   0.35 for studies published in 2016 or earlier ([82]Table 1). This
   cannot simply be attributed to differences in the cancer types studied
   in different years, as the two cancer types profiled twice (colon and
   breast) see an improvement from the earlier studies ([83]Table 1). This
   would suggest that technical and experimental factors may influence the
   reported mRNA-protein correlations and that improvements in either
   technology or experimental protocols have resulted in improved
   mRNA-protein correlations over time.

The correlation across replicate proteomic profiles is only moderate

   To assess the reproducibility of mass spectrometry-based proteomic
   measurements, we analyzed three studies containing replicate proteomic
   profiles: ovarian tumor samples ([84]Zhang et al., 2016), colon tumor
   samples ([85]Vasaikar et al., 2019), and cancer cell lines of mixed
   lineages from the Cancer Cell Line Encyclopedia (CCLE) ([86]Nusinow
   et al., 2020) ([87]Figure 1A). The nature of the replicates varies
   across the different studies: for ovarian cancer, the same tumor sample
   was profiled in two different laboratories, for the cancer cell lines,
   biological replicates were performed within the same lab 1 year apart,
   while for colon cancer, the same tumor samples were profiled with two
   different mass spectrometry (MS) techniques, i.e., isotope-based
   protein quantification (TMT-10) and label-free spectral counting MS.
   Thus, there is diversity in the replicate proteomic profiles in terms
   of sample types (tumor samples and cancer cell lines), sites, and
   techniques used to quantify the proteins.

Figure 1.

   [88]Figure 1
   [89]Open in a new tab

   Protein-protein reproducibility across replicates is moderate and
   variable

   (A) Overview of the replicates available for the three different
   proteomic studies.

   (B) For each study, we calculate the Spearman correlation for
   individual proteins across the proteomic replicates. The distribution
   of the protein-protein reproducibility is shown in the histogram for
   all measured proteins. For each study, the black dashed line represents
   the median.

   Many biological factors that influence mRNA-protein correlation, such
   as post-transcriptional regulation, are not relevant in the case of
   replicate measurements of proteins, and so we expected the replicate
   proteomic profiles to be more highly correlated than mRNA and protein
   profiles. This was indeed the case for all studies. The median
   protein-protein reproducibility for the replicate proteomic profiles
   from the CCLE dataset was 0.72 ([90]Figure 1B; [91]Table S2), whereas
   the median mRNA-protein correlation was only 0.48 ([92]Table 1). The
   median protein-protein reproducibility for the replicate proteomic
   profiles of ovarian tumors was 0.57 ([93]Figure 1B), which is higher
   than the median mRNA-protein correlation of 0.41 ([94]Table 1). The
   replicate protein-protein reproducibility for the colon study (median
   0.28) was much lower than that observed for the other studies. However,
   it was still higher than the median-calculated mRNA-protein correlation
   (0.21). One reason for the colon study to have a low median
   protein-protein reproducibility is that one of the two replicate
   proteomic profiles is quantified using label-free/spectral counting MS,
   which is not as accurate as the stable isotope-based protein
   quantification methods ([95]Liu et al., 2016). Overall, we can conclude
   that although protein-protein reproducibility is consistently higher
   than mRNA-protein correlations, the protein-protein reproducibility is
   still only moderate.

Proteins with higher reproducibility have higher mRNA-protein correlation

   The moderate correlations reported between mRNA and protein abundances
   have been attributed to a variety of biological factors, including
   post-transcriptional regulation, varying translation rates, and varying
   degradation rates ([96]Buccitelli and Selbach, 2020; [97]Payne, 2015;
   [98]Vogel and Marcotte, 2012). However, our observation that some
   proteins can be quantified more reproducibly than others suggests that
   noise in quantification may also be a major factor. If this is the
   case, we would expect that proteins that can be more reproducibly
   quantified will have a higher mRNA-protein correlation. To assess this,
   for each study we used the replicate proteomic profiles to stratify the
   proteins into deciles, ranging from the 10% of proteins with the lowest
   protein-protein reproducibility to the 10% with the highest
   protein-protein reproducibility ([99]STAR Methods). We then calculated
   the mRNA-protein correlation for all of the proteins within each
   decile. We found, for all three studies, that the median mRNA-protein
   correlation increases with protein-protein reproducibility
   ([100]Figure 2). The colon cancer study shows a difference in the
   median mRNA-protein correlation of 0.33 between the first and last
   deciles of protein reproducibility. Similarly, ovarian cancer data show
   a difference of 0.35, and the CCLE data show a difference of 0.37. This
   indicates that the reproducibility of proteomic measurements has a
   major impact on the calculated mRNA-protein correlation. We used a
   linear regression model to understand how much of the variation in
   mRNA-protein correlation can be explained by variation in
   protein-protein reproducibility and found that it explains
   approximately 14%, 17%, and 23% in the ovarian, CCLE, and colon
   studies, respectively ([101]STAR Methods; [102]Figure 2 and [103]S1A).

Figure 2.

   [104]Figure 2
   [105]Open in a new tab

   Proteins with higher reproducibility have higher mRNA-protein
   correlation

   (A–C) Boxplots showing the distribution of mRNA-protein correlation for
   proteins binned according to their protein-protein reproducibility in
   the colon (A), ovarian (B), and CCLE (C) studies. The total number of
   proteins considered for each plot is indicated at the top right corner.
   The bins are deciles—each containing ∼10% of the proteins. The decile
   is indicated on the x axis along with the highest correlation between
   experimental replicates present within that decile. For each box plot,
   the black central line represents the median, the top and bottom lines
   represent the 1st and 3rd quartiles, and the whiskers extend to 1.5
   times the interquartile range past the box. Outliers are not shown. The
   median of each decile is indicated above/below the black central line
   for each box plot. The median mRNA-protein correlation across all
   proteins for each study is indicated as a dotted gray line in each
   plot. The R^2 obtained from regressing the mRNA-protein correlation on
   protein-protein reproducibility is in the bottom right corner.

   Previous work has identified protein complex membership as the factor
   most predictive of variation in mRNA-protein correlation, with subunits
   of protein complexes typically having lower-than-average mRNA-protein
   correlation ([106]Gonçalves et al., 2017; [107]Ryan et al., 2017).
   Using the same linear modeling approach as above, we found that protein
   complex membership explains approximately 3%, 8%, and 6.7% of the
   variation in the ovarian, CCLE, and colon studies, respectively
   ([108]Figure S1A). This suggests that noise in the quantification of
   protein abundances explains much more (on average ∼3 times) of the
   variance in mRNA-protein correlation than the most predictive
   previously identified factor. Combined, the protein-protein
   reproducibility and protein complex membership features explained
   approximately 17%, 23%, and 26% of the variation in mRNA-protein
   correlation in the ovarian, CCLE, and colon studies, respectively
   ([109]Figure S1A). This is significantly more than protein complex
   membership or protein-protein reproducibility alone (p < 0.001,
   likelihood ratio test), suggesting that protein complex membership and
   protein reproducibility independently contribute to the variation in
   mRNA-protein correlation. This is also evident when binning proteins
   into reproducibility deciles—although proteins that are complex
   subunits are present in every decile, they have consistently lower
   mRNA-protein correlations ([110]Figures S1B–S1D).

Proteins with high reproducibility in one study are also highly reproducible
in other studies

   In addition to providing a summary of how reproducible the protein
   measurements from each study are on average, the replicate profiles
   enable us to see which proteins are most reproducibly quantified
   overall. In the CCLE study, the median correlation between replicate
   measurements calculated across all proteins was 0.72, but this ranged
   from −0.2 to 1.0 for individual proteins. Similarly, the median for all
   proteins in the ovarian study was 0.57, but the individual correlations
   ranged from −0.6 to 1.0, and the median for the colon tumor study was
   0.28 with a range from −0.2 to 0.8. This suggests that, at least within
   individual studies, some proteins may be more reproducibly quantified
   than others.

   To understand whether the same proteins were reproducibly quantified
   across multiple studies, we analyzed pairs of studies together. We
   found that there was a moderate correlation (0.38) between the protein
   reproducibility calculated using the ovarian tumor replicates and the
   colon cancer replicates ([111]Figure 3A). Combinations of other pairs
   of studies revealed similar moderate correlations: colon and CCLE
   (0.31) and ovarian and CCLE (0.24) ([112]Figures 3B and 3C). Although
   the nature of the samples (tumor versus cell line) and the
   quantification approaches (TMT/label-free quantification) varied across
   studies, this suggests that there is some agreement in terms of which
   proteins can be reproducibly quantified. In general, proteins that are
   highly reproducible in one study tend to be highly reproducible in
   others, while proteins that show poor reproducibility in one study tend
   to show poor reproducibility in others ([113]Figure 3). For example,
   GBP1 is one of the proteins with reproducibility that is consistently
   high across all three studies ([114]Figure 3D), while RPS29 has
   consistent low reproducibility ([115]Figure 3E).

Figure 3.

   [116]Figure 3
   [117]Open in a new tab

   Proteins with high reproducibility in one study are also highly
   reproducible in other studies

   (A–C) Binned heatmaps showing the relationship between the
   protein-protein reproducibility calculated in different studies. Each
   heatmap shows the relationship between two studies, indicated on the x
   and y axes. The regions of the heatmaps are colored according to the
   number of proteins present in the region as indicated in the color bar.
   The number of proteins in common and Spearman correlation between the
   two studies, with the associated p value, are specified in the box for
   each of the plots.

   (D and E) For each study with experimental protein replicates, scatter
   plots illustrating the relationship between protein-protein
   reproducibility are shown for a protein with high reproducibility, GBP1
   (D), and a protein with low reproducibility, RPS29 (E). For each
   scatter plot, the Spearman correlation coefficient of the
   protein-protein reproducibility and the associated p value is indicated
   at the bottom.

An integrated ranking of protein reproducibility partially explains the
variable mRNA-protein correlation in 10 additional studies

   Proteogenomic studies with large numbers of replicates, such as the
   three we analyzed above, are the exception rather than the rule.
   Consequently, for most studies, we do not know how reproducible the
   proteomic measurements are. However, as noted above, proteins that are
   highly reproducibly quantified in one study are more likely to be
   highly reproducible in others. We therefore sought to aggregate the
   replicate protein correlations from all three studies (CCLE, ovarian,
   colon) into a single list containing a ranking of protein
   reproducibility ([118]STAR Methods; [119]Figure S2A; [120]Table S2). We
   evaluated a number of different aggregation approaches and found that a
   simple method using average normalized rank explained the most variance
   in mRNA-protein correlations of the three studies containing proteomic
   replicates ([121]STAR Methods; [122]Figure S2B). We used this approach
   to create a ranked order of protein reproducibility for the 5,211
   proteins that were quantified in at least two out of the three studies.
   We then used this aggregated list to assess the extent to which
   “average” protein reproducibility explains the varying mRNA-protein
   correlations observed in ten other studies ([123]Clark et al., 2019;
   [124]Dou et al., 2020; [125]Gillette et al., 2020; [126]Guo et al.,
   2019; [127]Huang et al., 2021; [128]Jiang et al., 2020; [129]Krug
   et al., 2020; [130]Mertins et al., 2016; [131]Wang et al., 2021;
   [132]Zhang et al., 2014) ([133]Figure 4). For all these studies, we
   find that proteins with more reproducible measurements tend to have
   higher mRNA-protein correlations. Although the aggregated ranks are
   based on data from cancer studies, we observe the same trend in healthy
   tissues obtained from the GTEx project ([134]Figure 4J). Similarly,
   although the aggregated ranks are generated using studies that quantify
   proteins through data-dependent acquisition (DDA) approaches, we
   observed the same trend for a study that quantified proteins using
   data-independent acquisition (DIA)-based proteomics (sequential window
   acquisition of all theoretical mass spectra [SWATH-MS]) in the NCI-60
   cancer cell lines ([135]Figure 4I). In general, the mRNA-protein
   correlation increases with protein reproducibility for samples from
   both healthy and diseased conditions and irrespective of the proteomic
   quantification approach.

Figure 4.

   [136]Figure 4
   [137]Open in a new tab

   Aggregated protein reproducibility ranks partially explains the
   variable mRNA-protein correlation in 10 additional studies

   (A–J) For studies without experimental proteomic replicates, boxplots
   showing the distributions of mRNA-protein correlation for proteins in
   each decile of the aggregated protein reproducibility ranks. (A)–(H)
   are the CPTAC tumor studies; (I) is the NCI-60 cancer cell lines study
   wherein protein quantification, used for computing the mRNA-protein
   correlation, is obtained from data-independent acquisition-based
   untargeted proteomics (SWATH-MS); and (J) is the healthy tissues study
   from the GTEx Consortium. Box plot details as in [138]Figure 2.

   To quantify the amount of variation in mRNA-protein correlation that
   could be explained by our aggregated protein reproducibility ranks, we
   used a linear regression model for the ten different studies. We found
   that the aggregated ranks explain ∼10%–20% (median 14%) of the
   variation in these studies ([139]Figure 4).

   To test if there was an advantage to using the aggregate protein
   reproducibility over protein reproducibility measured in either of the
   three individual studies (CCLE, ovarian, colon), we compared the
   variance explained by the aggregate ranks with that explained by each
   individual study. In all ten studies without proteomic replicates, the
   aggregated ranks explained the variation in mRNA-protein correlation
   better than the ranks from any individual dataset ([140]Figure S3).

   A number of efforts have been made to use machine learning to predict
   protein abundances from mRNA abundances ([141]Fortelny et al., 2017;
   [142]Li et al., 2019; [143]Yang et al., 2020). Recently, the NCI-CPTAC
   DREAM proteogenomics challenge engaged the community to predict protein
   abundances of breast and ovarian tumor profiles using their
   corresponding genomic and transcriptomic information ([144]Yang et al.,
   2020). We hypothesized that proteins whose measurements are highly
   reproducible could be predicted better using machine-learning
   algorithms. Hence, we analyzed the prediction scores from the
   best-performing model using the protein reproducibility data. We
   observed a stark difference in the prediction scores of the lowest and
   highest deciles of the protein reproducibility ([145]Figures S4A and
   S4B). While the lowest decile has a correlation of ∼0.35 between the
   measurements and predictions, the highest decile has a correlation of
   ∼0.7. The aggregated protein reproducibility ranks could explain ∼25%
   and 26% of the variation in the prediction scores of breast and ovarian
   cancer studies, respectively, again outperforming the reproducibility
   measured in any individual study ([146]Figure S4C).

Protein measurement reproducibility is influenced by abundance, variance, and
unique peptides

   To understand what causes some proteins to be more reproducibly
   measured than others, we analyzed a number of factors that we
   hypothesized might influence the reliability of their measurements.

   All of the studies analyzed here make use of “bottom-up” quantification
   approaches where proteins are first digested into peptides; these
   peptides are then quantified using a mass spectrometer, and peptide
   quantifications are converted into protein abundances computationally.
   This quantification is a stochastic process, and there is no guarantee
   that every peptide in a given sample will be detected by the mass
   spectrometer. The quantification of proteins that have low abundance,
   and hence fewer detectable peptides, is especially likely to be subject
   to substantial stochastic variation. A small number of peptides missed
   can make a big difference to the quantification of these low abundance
   proteins, while for highly abundant proteins, a few extra or missing
   peptides will make only a small difference. To assess the contribution
   of protein abundance to protein measurement reproducibility, we
   obtained the protein abundances measured in 201 tissue samples from 32
   healthy human tissues collected by the GTEx project ([147]Jiang et al.,
   2020). For each protein, we calculated the mean abundance across all
   samples and tissues. We found a clear relationship between the mean
   protein abundance and the aggregated protein reproducibility rank—more
   abundant proteins are more reproducibly measured ([148]Figure 5A). We
   performed a similar analysis for the three individual proteomic
   replicate studies and found the result to be consistent
   ([149]Figures S5A–S5C).

Figure 5.

   [150]Figure 5
   [151]Open in a new tab

   Protein reproducibility is mainly influenced by abundance, variance,
   and unique peptides and not protein half-lives

   (A–C) Boxplots showing the distribution of aggregated protein
   reproducibility ranks for proteins binned according to protein
   abundance (A), variance (B), and number of unique peptides (C). Box
   plot details as in [152]Figure 2.

   (D) Boxplot showing the distribution of aggregated protein
   reproducibility ranks for proteins with short and long protein
   half-lives.

   Proteins whose abundances do not vary significantly across individuals
   are unlikely to have high mRNA-protein correlations, as correlation
   measures are dependent on there being meaningful variation in the data.
   Furthermore, as the variation observed experimentally is likely a
   combination of both real biological variation and experimental noise,
   proteins with lower biological variation in abundance will tend to be
   more affected by measurement noise. For each protein, we computed the
   variance in protein abundance across samples from the GTEx project
   ([153]Jiang et al., 2020). We then assessed the influence of this
   variance on the reproducibility of measurements of individual proteins.
   Similar to the mean protein abundance above, we found that proteins
   with a higher variance of protein abundance are more reproducibly
   measured ([154]Figure 5B). Furthermore, the variance of protein
   abundance explains ∼20% of the variation in the aggregated protein
   reproducibility ranks. Similar trends were observed for the three
   individual proteomic replicate studies ([155]Figures S5A–S5C).

   The number of unique peptides generated per protein is also crucial for
   protein quantification by MS. To assess the impact of this, we
   identified the number of unique peptides identified per protein using
   the GTEx study. We stratified all proteins into deciles based on the
   number of unique peptides identified and found that the aggregated
   protein reproducibility increased with every decile of unique peptides
   identified ([156]Figure 5C). This pattern was also evident in the
   protein reproducibility measured in each of the three individual
   studies ([157]Figures S5A–S5C). Thus, the more unique peptides
   identified per protein, the higher the confidence of the measured
   protein levels.

   One of the biological reasons proposed for the weak mRNA-protein
   correlation is the difference in mRNA and protein half-lives
   ([158]Vogel and Marcotte, 2012). mRNAs typically have a half-life of
   2.6–7 h, while proteins have half-lives ranging from a few seconds to a
   few days ([159]Vogel and Marcotte, 2012). Recently, proteins with
   longer half-lives were found to be more predictable using machine
   learning, irrespective of the transcript half-lives ([160]Yang et al.,
   2020). This led us to assess protein half-life as a potential factor
   for the reproducibility of protein measurements. We obtained protein
   half-lives estimations from a previous publication ([161]Zecha et al.,
   2018) and divided them into two categories—long and short half-lives
   ([162]STAR Methods)—as was done in [163]Yang et al. (2020). Although
   both categories contain proteins with reproducibility scores ranging
   from 0 to 1, proteins with a long half-life have a higher median
   protein reproducibility score (p = 9.70e−25, Mann-Whitney U test,
   two-sided; [164]Figures 5D and [165]S5A–S5C).

   We note that there is some correlation between the attributes
   considered, in particular more abundant proteins tend to have more
   unique peptides identified. To understand the relative contribution of
   each factor, we performed rank regression by using the individual
   factors as the explanatory variables and the ranks of the proteomic
   reproducibility as the response variable ([166]STAR Methods). We found
   in all cases that a model including all four factors performed better
   than a model including only the best individual factor, suggesting that
   variance in reproducibility can best be explained by a combination of
   factors ([167]Figure S5D).

   The factors above all contribute to protein-protein reproducibility,
   raising the question of whether they themselves might be sufficient to
   explain variation in mRNA-protein correlation. To assess this, we
   performed linear regression with these factors (abundance, variance,
   unique peptides, and protein half-lives) as explanatory variables and
   the mRNA-protein correlation of each of the 13 different studies as
   response variables. We found that a combined model of the factors
   explained ∼3%–17% of the variation in mRNA-protein correlation of the
   different studies ([168]Figure S6). However, the aggregated protein
   reproducibility explains a considerably higher percentage of the
   variation in mRNA-protein correlation in 12 of 13 studies. The GTEx
   study is the lone exception, likely a result of the independent
   variables (protein abundance, variance, number of unique peptides)
   being calculated from the GTEx study itself ([169]Figure S6).

Transcriptomic reproducibility also contributes to the variance in
mRNA-protein correlation

   Thus far, we have primarily focused on understanding the influence of
   protein quantification reproducibility on mRNA-protein correlation.
   However, it is also likely that the reproducibility of mRNA
   measurements is an important factor in determining mRNA-protein
   correlations. To assess the impact of transcriptomic reproducibility on
   mRNA-protein correlation, we compared transcriptomic profiles for 382
   cancer cell lines from the CCLE ([170]Ghandi et al., 2019) with those
   generated in a separate profiling effort ([171]Klijn et al., 2015). We
   find that the median gene-wise Spearman correlation across studies was
   0.75 ([172]STAR Methods; [173]Figure 6A). Again, this varied
   significantly across transcripts, ranging from −0.05 to 0.96. As with
   protein reproducibility, we find that transcriptomic reproducibility is
   influenced by both mRNA abundance and variance ([174]STAR Methods;
   [175]Figure S5E).

Figure 6.

   [176]Figure 6
   [177]Open in a new tab

   Transcriptomic reproducibility contributes to the variance in
   mRNA-protein correlation

   (A) Histogram showing the distribution of the gene-wise correlation
   between experimental transcriptomic replicates of 382 cancer cell
   lines. The black line represents the median.

   (B) For each of the 13 studies analyzed here, the R-squared obtained by
   regressing mRNA-protein correlation on transcriptomic reproducibility
   and aggregated protein reproducibility scores individually and in
   combination over the same set of proteins is shown in the dot plot. The
   number of proteins analyzed for each study is indicated in brackets
   below the study on the y axis.

   We used a linear regression model to quantify, in all thirteen
   proteogenomic studies, how much of the variation in mRNA-protein
   correlation could be explained by transcriptomic reproducibility. We
   found that the median variance explained was 15%. In most studies
   (8/13), our aggregated protein reproducibility explained a higher
   proportion of the variance than the mRNA reproducibility
   ([178]Figure 6B).

   Compared with the other studies, the CCLE study had a strikingly higher
   percentage of variance explained by transcriptomic reproducibility
   (40%). This is presumably because there is a large overlap in the set
   of samples used to compute the transcriptomic reproducibility and the
   CCLE mRNA-protein correlation, unlike the other studies. For the CCLE,
   the variance explained by mRNA-mRNA reproducibility is higher than the
   variance explained by protein-protein reproducibility. However, the
   mRNA-mRNA reproducibility was estimated using a much higher number of
   cell lines (382 versus 18 for protein-protein reproducibility), which
   we reasoned could explain the increased variance explained. To test
   this hypothesis, we downsampled the available transcriptomic data to
   make the comparison more equal (sampling 18 cell lines with
   transcriptomes at random; [179]STAR Methods). We found that, using this
   approach, the protein-protein reproducibility explained more of the
   mRNA-protein variability than the mRNA-mRNA reproducibility (on
   average, ∼2.8 times). This suggests that protein-protein
   reproducibility may influence mRNA-protein correlation more than
   mRNA-mRNA reproducibility does but that 18 cell lines is not sufficient
   to obtain a robust estimate of protein-protein reproducibility.

   The Spearman correlation between aggregated protein reproducibility and
   CCLE transcriptomic reproducibility is 0.37 across 4,795 proteins. This
   suggests that there is some agreement between the reproducibility of
   proteins and transcripts and that, to some extent, proteins that are
   reproducibly measured are encoded by transcripts that are more
   reproducibly measured. To assess if both mRNA and protein
   reproducibility independently contribute to the variability of
   mRNA-protein correlation across all 13 studies, we used a linear model
   with the two factors as independent variables and mRNA-protein
   correlation as the dependent variable. We found that in all cases, the
   two factors together explained a higher proportion of variance than
   either factor alone (p < 0.001, likelihood ratio test). In the case of
   the CCLE study (used to calculate the mRNA reproducibility and one of
   the three studies used to calculate protein reproducibility), the two
   factors together explained 48% of the variance. For the 12 other
   studies, the two factors together explained ∼14%–26% of the variance
   ([180]Figure 6B). These observations suggest that the reproducibility
   in transcriptomic and proteomic data contribute strongly and somewhat
   independently to the variability observed in mRNA-protein correlation.

Metabolic pathways with higher-than-average mRNA-protein correlations may
reflect differential reproducibility rather than differential
post-transcriptional regulation

   Previous work has found that certain pathways and processes are
   enriched in proteins that have higher- or lower-than-average
   mRNA-protein correlations. For instance, ribosomal subunits have been
   found to have consistently lower-than-average mRNA-protein correlations
   across multiple studies ([181]Clark et al., 2019; [182]Mertins et al.,
   2016; [183]Zhang et al., 2014, [184]2016), while members of pathways
   related to amino acid metabolism have been found to have
   higher-than-average mRNA-protein correlation ([185]Clark et al., 2019;
   [186]Huang et al., 2021; [187]Jarnuczak et al., 2021; [188]Mertins
   et al., 2016; [189]Zhang et al., 2014, [190]2016). This variation
   across functional groups has been attributed to differential
   post-transcriptional regulation. However, our observation that both
   protein-protein measurement reproducibility and mRNA-mRNA measurement
   reproducibility contribute significantly to the variation in
   mRNA-protein correlation across genes suggests an alternative
   explanation—some pathways may have higher- or lower-than-average
   mRNA-protein correlations simply because their component proteins are
   more reproducibly measured. To test this hypothesis, we first performed
   pathway enrichment analysis on the mRNA-protein correlations from the
   CCLE and ovarian datasets ([191]STAR Methods; [192]Figures 7 and
   [193]S7). Consistent with previous studies, we observed that proteins
   with high mRNA-protein correlations are enriched in gene sets involved
   in environmental information processing and metabolic pathways, while
   proteins with low mRNA-protein correlations are enriched in annotations
   related to housekeeping protein complexes ([194]Figure 7;
   [195]Tables S3 and [196]S4). To assess whether these enrichments could
   simply be attributed to variable reproducibility, we next performed
   pathway enrichment analysis on the CCLE and ovarian mRNA-protein
   correlation data after accounting for variation in protein-protein and
   mRNA-mRNA reproducibility ([197]STAR Methods). We found in both studies
   that the “housekeeping” protein complexes were still identified as
   being enriched among proteins with lower-than-average mRNA-protein
   correlations but that the metabolic pathways were no longer enriched in
   proteins with higher-than-average mRNA-protein correlations
   ([198]Figures 7 and [199]S7; [200]Tables S3 and [201]S4). Other
   pathways with higher-than-average mRNA-protein correlations related to
   environmental information processing were also no longer significant
   after adjusting for reproducibility. This suggests that while large
   housekeeping protein complexes such as the ribosome have
   lower-than-average mRNA-protein correlation that may be attributed to
   post-transcriptional mechanisms, the higher-than-average mRNA-protein
   correlation previously observed for metabolic pathways may simply
   reflect more reproducible measurements of their constituent proteins
   and transcripts.

Figure 7.

   [202]Figure 7
   [203]Open in a new tab

   Metabolic pathways with higher-than-average mRNA-protein correlations
   may reflect differential reproducibility

   Bar charts displaying the KEGG pathway enrichment analysis of the CCLE
   mRNA-protein correlation before (left) and after (right) accounting for
   protein-protein and mRNA-mRNA reproducibility. The −log[10] of
   Benjamini-Hochberg false discovery rate (FDR)-corrected p values
   calculated using Mann-Whitney U test is used to assess enrichment for
   the pathway. For each bar chart, the gray line indicates the threshold
   considered for significant enrichment (FDR < 0.05). If the enrichment
   is below the threshold, then it is not considered significant. The bars
   are colored orange if the median mRNA-protein correlation of genes
   within the pathway is greater than the median mRNA-protein correlation
   of genes not in the pathway; otherwise, the bars are colored blue.

Discussion

   Here, we have demonstrated that the reproducibility of protein and
   transcript measurements is a very significant factor in the observed
   correlations between mRNA and protein abundances. After taking this
   into account, we found that some pathways previously identified as
   having a high mRNA-protein correlation are likely just more
   reproducibly measured. We therefore suggest that conclusions about
   functional groups with higher or lower mRNA-protein correlations,
   especially with regard to the potential role played by
   post-transcriptional regulation, should be made only after accounting
   for variation in the measurement reproducibility of their constituent
   proteins. To this end, we have generated an aggregate protein
   reproducibility rank for each protein that can explain a significant
   amount of the variance across multiple proteogenomic studies and that
   may be useful for identifying those proteins that can be reliably and
   reproducibly measured by mass spectrometry. Such proteins may be more
   useful to assay in, e.g., diagnostic panels.

   Recently, there have been a number of attempts to predict protein
   abundances from transcriptomic data that have achieved modest success
   ([204]Barzine et al., 2020; [205]Fortelny et al., 2017; [206]Li et al.,
   2019; [207]Yang et al., 2020). We found here that proteins that are
   more reproducibly measured across experimental replicates are better
   predicted using machine-learning. This suggests that one of the factors
   limiting the accuracy of machine-learning methods to predict protein
   abundances is that the protein abundance measurements themselves are
   not reproducible. It may therefore be worth evaluating future methods
   on the subset of proteins that can be reproducibly measured.

Limitations of the study

   Our emphasis here has been on understanding how variability in the
   measurements of individual proteins can influence the mRNA-protein
   correlations observed in published tumor proteogenomic studies. We have
   shown that proteins/transcripts that are more reproducibly measured
   tend to have higher mRNA-protein correlations, and we have identified a
   number of factors (e.g., protein abundance) that influence variation in
   measurement reproducibility. There are of course additional factors
   that influence the global reproducibility of proteomes and
   transcriptomes quantified from “replicates” of the same sample. These
   include real biological variation (e.g., tumor heterogeneity resulting
   in two samples of the same tumor having different profiles) and
   technical variation (e.g., variation in sample preparation between
   different runs of the same sample). We have not been able to address
   how much of the variance in the measurements of individual proteins can
   be attributed to these global factors. It is likely that reducing these
   sources of global variation, e.g., through automated sample
   preparation, will improve the overall reproducibility of protein
   measurements. We note also that our analyses do not reflect the best
   possible reproducibility of proteomic and transcriptomic measurements,
   but rather they reflect the reproducibility observed in existing
   large-scale proteogenomic datasets. Indeed, we see that more recent
   proteogenomic studies have higher mRNA-protein correlations, suggesting
   that methodological improvements are already reducing the sources of
   noise in these approaches.

   Our results from analyzing the CCLE dataset, where the replicate
   correlation is highest, give what is likely the most realistic
   assessment of proteomic reproducibility using modern MS pipelines. The
   replicates in the CCLE study were generated by the same lab, using the
   same methodology, 1 year apart. In contrast, the ovarian cancer study
   contains replicates generated in different labs (introducing
   significant measurement heterogeneity), while the colon cancer study
   makes use of replicates generated using two entirely different MS
   approaches (label free versus TMT). Consequently, these likely
   represent lower-bound estimates of the reproducibility that can be
   observed using modern MS proteomic pipelines. Nonetheless, they likely
   reasonably approximate the non-biological heterogeneity observed
   between transcriptomes and proteomes in the studies analyzed, where
   mRNA and proteins are quantified separately using orthogonal
   techniques.

   In the case of the CCLE data, we have a small number (18) of samples
   with replicate proteomic profiles available and a larger number (382)
   with replicate transcriptomes. However, only 8 samples have both
   replicate transcriptomes and replicate proteomes. Having a larger
   number of samples with both replicate proteomes and replicate
   transcriptomes would allow us to better estimate the actual correlation
   between mRNA and protein across samples after correction for
   measurement error, as has previously been done to estimate the true
   "within sample" mRNA-protein correlation in yeast ([208]Csárdi et al.,
   2015; [209]Spearman, 1904).

   Here, we have shown that a number of factors measured using the GTEx
   dataset, including measured protein abundance, measured protein
   variance, and measured unique peptides, are associated with proteomic
   reproducibility in cancer studies. Across studies, a model that
   incorporates all three factors outperformed models using each variable
   alone. However, this may be because all three represent imperfect
   measurements of the same underlying variable—real average protein
   abundance. Previous work has demonstrated that statistical modeling
   that integrates multiple mRNA and protein datasets and explicitly takes
   into account different sources of noise and error can be used to
   provide improved estimates of mRNA-protein correlation within samples
   ([210]Csárdi et al., 2015). As additional studies with proteomic
   replicates and transcriptomic replicates become available, it may be
   possible to develop improved models that provide more reliable
   estimates of protein reproducibility and the factors that influence it.
   Such estimates could be improved through the incorporation of
   additional estimates of average protein abundance and variation (e.g.,
   from [211]Wang et al., 2019).

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Deposited data
     __________________________________________________________________

   cBioPortal [212]Cerami et al. (2012); [213]Gao et al. (2013)
   [214]https://www.cbioportal.org/; RRID:[215]SCR_014555
   Cancer Dependency Map (DepMap) 20Q4 [216]Ghandi et al. (2019)
   [217]https://depmap.org/portal/ccle/;
   [218]https://figshare.com/articles/dataset/DepMap_20Q4_Public/13237076;
   RRID:[219]SCR_017655
   LinkedOmics [220]Vasaikar et al. (2018)
   [221]http://www.linkedomics.org/
   CPTAC Python API [222]Lindgren et al. (2021)
   [223]https://pypi.org/project/cptac/
   CORUM 3.0 [224]Giurgiu et al. (2019)
   [225]http://mips.helmholtz-muenchen.de/corum/; RRID:[226]SCR_002254
   KEGG Pathway [227]Kanehisa (2019); [228]Kanehisa and Goto (2000);
   [229]Kanehisa et al. (2021)
   [230]https://www.genome.jp/kegg/pathway.html; RRID:[231]SCR_018145
   Colorectal cancer transcriptomics [232]Cancer Genome Atlas Network
   (2012)
   [233]https://cbioportal-datahub.s3.amazonaws.com/coadread_tcga_pub.tar.
   gz
   Colorectal cancer proteomics [234]Zhang et al. (2014) Published
   supplemental Table S4
   Ovarian cancer transcriptomics [235]Cancer Genome Atlas Research
   Network (2011)
   [236]http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/OV/20
   160128/gdac.broadinstitute.org_OV.mRNA_Preprocess_Median.Level_3.201601
   2800.0.0.tar.gz
   Ovarian cancer proteomics [237]Zhang et al. (2016) Published
   supplemental Table S2
   Breast Cancer (2016) transcriptomics [238]Ciriello et al. (2015)
   [239]https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.
   gz
   Breast Cancer (2016) proteomics [240]Mertins et al. (2016) Published
   supplemental Table S3
   Colon Cancer [241]Vasaikar et al. (2019)
   [242]http://linkedomics.org/cptac-colon/
   Clear cell renal carcinoma [243]Clark et al. (2019)
   [244]https://pypi.org/project/cptac/
   Breast Cancer (2020) [245]Krug et al. (2020)
   [246]https://pypi.org/project/cptac/
   Endometrial Cancer [247]Dou et al. (2020)
   [248]https://pypi.org/project/cptac/
   Lung Adenocarcinoma [249]Gillette et al. (2020)
   [250]https://pypi.org/project/cptac/
   Head and Neck Squamous Cell Carcinoma [251]Huang et al. (2021)
   [252]https://pypi.org/project/cptac/
   Glioblastoma [253]Wang et al. (2021)
   [254]https://pypi.org/project/cptac/
   NCI60 cancer cell lines [255]Guo et al. (2019) Published supplemental
   Tables S6 and S1
   Cancer Cell Line Encyclopedia (CCLE) transcriptomics [256]Ghandi et al.
   (2019) [257]https://depmap.org/portal/ccle/; RRID:[258]SCR_013836
   CCLE proteomics [259]Nusinow et al. (2020) Published supplemental
   Tables S2 and S3;
   [260]https://gygi.hms.harvard.edu/publications/ccle.html
   GTEx healthy tissues [261]Jiang et al. (2020) Published supplemental
   Tables S3 and S4
   RNA-seq of 675 commonly used human cancer cell lines [262]Klijn et al.
   (2015) ArrayExpress: E-MTAB-2706
   Protein half-life [263]Zecha et al. (2018) Published supplemental
   Table S3
   NCI CPTAC DREAM Proteogenomics challenge prediction scores of the best
   performing model (Team Guan) [264]Yang et al. (2020)
   [265]https://heidelberg.shinyapps.io/proteoexplorer/
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   All analysis code This study
   [266]https://github.com/cancergenetics/limitations_of_omics_reproducibi
   lity; [267]https://doi.org/10.5281/zenodo.6956546
   Python version 3.8 Python Software Foundation
   [268]https://www.python.org/; RRID:[269]SCR_008394
   Pandas 1.2.5 [270]McKinney (2011) [271]https://pandas.pydata.org/;
   RRID:[272]SCR_018214
   Numpy 1.20.2 [273]Harris et al. (2020) [274]https://numpy.org/;
   RRID:[275]SCR_008633
   StatsModels 0.12.2 [276]Seabold and Perktold (2010)
   [277]https://www.statsmodels.org/stable/index.html;
   RRID:[278]SCR_016074
   SciPy 1.7.1 [279]Virtanen et al. (2020) [280]https://www.scipy.org/;
   RRID:[281]SCR_008058
   Matplotlib 3.3.4 [282]Hunter (2007) [283]https://matplotlib.org/; RRID:
   [284]SCR_008624
   Seaborn 0.11.0 [285]Waskom (2021) [286]https://seaborn.pydata.org/;
   RRID:[287]SCR_018132
   [288]Open in a new tab

Resource availability

Lead contact

   Further information and requests for resources should be directed to
   and will be fulfilled by the lead contact, Colm Ryan
   ([289]colm.ryan@ucd.ie).

Materials availability

   This study did not generate new materials.

Method details

Data collection

   The datasets analysed were downloaded from the links provided in the
   [290]key resources table.

   For studies ([291]Clark et al., 2019; [292]Dou et al., 2020;
   [293]Gillette et al., 2020; [294]Huang et al., 2021; [295]Krug et al.,
   2020; [296]Wang et al., 2021) both the transcriptomic and proteomic
   profiles were obtained from the CPTAC API ([297]Lindgren et al., 2021).
   For colorectal ([298]Zhang et al., 2014) and breast cancer
   ([299]Mertins et al., 2016) studies, the transcriptomic data were
   downloaded from cBioPortal while proteomic data was obtained from the
   supplemental materials. For the ovarian cancer study ([300]Zhang
   et al., 2016), the transcriptomic data were downloaded from the
   [301]https://gdac.broadinstitute.org/ and proteomic data from the
   supplemental materials. For colon cancer ([302]Vasaikar et al., 2019),
   GTEX ([303]Jiang et al., 2020) and NCI60 ([304]Guo et al., 2019) cancer
   cell lines studies, both the transcriptomic and proteomic data were
   obtained from the supplemental tables. For CCLE study, the
   transcriptomic data was downloaded from the cancer dependency map
   portal ([305]https://depmap.org/portal/ccle/) and proteomic data was
   downloaded from the supplemental materials.

Pre-processing proteomic and transcriptomic profiles

   Proteomics and transcriptomics data were obtained from the studies
   listed in the [306]Key resources table. The proteomics datasets
   contained a considerable number of missing values, identified as NaNs
   in most studies or 0s in ([307]Zhang et al., 2014). Within each study
   we restricted our analyses to proteins that were measured in at least
   80% of samples. The same filtering was applied to transcriptomics,
   requiring transcripts to be measured in 80% of samples. In some
   datasets, multiple protein isoforms from the same gene were available,
   we aggregated these using the mean to calculate a ‘gene level’ summary.

   The CCLE study repeatedly profiled two 10-plexes (18 cell lines) one
   year apart in order to assess the reproducibility of the proteomic
   profiling. These replicates are used to perform the assessment of the
   reproducibility of protein measurements presented in [308]Figure 1. In
   addition to these 18 cell lines, 3 cell lines were screened in
   duplicate as part of standard 10-plex runs. As suggested in the CCLE
   guide ([309]Nusinow and Gygi, 2020) for these three cell lines we
   selected the profiles which correlate best with the transcriptomic data
   for our analyses here.

Computation of correlation coefficient

   All data was processed through the standard pipeline described above
   before computing correlation. Correlation between (i) mRNA-protein,
   (ii) protein-protein and (iii) mRNA-mRNA was computed using the
   Spearman rank correlation. For each protein in each study, samples with
   missing values were ignored when computing the correlation.

Assessing proteomic and transcriptomic reproducibility

   The quantitative proteomics of the CCLE ([310]Nusinow et al., 2020)
   data contained three replicates of the proteomic profiles. In the first
   year, 18 cell lines (two 10-plexes) were quantified (R1). The same cell
   lines were quantified twice (R2, R3) the following year. The
   correlation between replicates: R1-R2, R1-R3 and R2-R3 were 0.7, 0.71
   and 0.88 respectively. We chose to use the R1 and R3 proteomic profiles
   to compute the replicate correlation as R1-R3 has the median
   correlation out of the three replicate pairs.

   To assess the reproducibility of transcriptomic data we considered two
   studies that had quantified transcripts in tumour-derived cell lines.
   One of the studies chosen was the CCLE transcriptomic study for which
   we have previously assessed the mRNA-protein correlation. The CCLE
   transcriptomic study ([311]Ghandi et al., 2019) had profiled 1076 and
   ([312]Klijn et al., 2015) had profiled 675 cancer cell lines using
   RNA-Seq. These two studies had quantified the transcripts in different
   labs in different years. However, the two studies had 382 cell lines
   and 13,226 genes in common. The transcriptomic reproducibility was
   computed using the Spearman rank correlation coefficient for the
   transcriptomic measurements across the 382 common cell lines of the
   studies. The standard pipeline for pre-processing was applied before
   assessing the reproducibility of the transcriptomic studies.

   While the CCLE transcriptomic reproducibility was computed using 382
   cell lines, the CCLE proteomic reproducibility was computed using 18
   cell lines only. The common cell lines between the transcriptomic and
   proteomic replicates were <10. Therefore, to compare the predictive
   power of transcriptomic reproducibility and proteomic reproducibility
   in explaining the variation in mRNA-protein correlation of the
   different studies, the transcriptomic reproducibility was computed for
   18 random cell lines over 100 iterations. The transcriptomic
   reproducibility was then used to predict the mRNA-protein correlation
   of the thirteen proteogenomic studies. For each study, the mean R^2
   obtained across all 100 random cell line selections was then used to
   compare the predictive power of transcriptomic reproducibility and
   proteomic reproducibility over the same number of proteins.

Computation of deciles

   Deciles were computed using the pandas qcut method. Each decile
   contains ∼10% of the overall number of items to be stratified. In some
   cases, due to ties, these deciles are not uniformly sized.

Protein complex membership

   Information on protein complex membership was obtained from CORUM
   ([313]Giurgiu et al., 2019) (all complexes data). A protein was marked
   as a protein complex subunit if it is identified in CORUM data.

Protein half-lives

   The half-lives of proteins were obtained from ([314]Zecha et al., 2018)
   study. The median half-life of all proteins from the list was computed.
   Proteins with half-lives > median were encoded to have ‘long’ half-life
   while the others were encoded to have ‘short’ half-life.

Rank aggregation

   For each of the three proteomic studies with replicates (ovarian,
   colon, CCLE) ranks were assigned based on increasing correlation and
   normalized by dividing over the total number of proteins in the
   dataset. Only proteins that were measured in 2 out of the 3 datasets
   were considered for the aggregated list. For proteins measured in only
   2 studies, we imputed the third normalised rank as 0.5. For all
   proteins, we then computed the mean rank as the aggregated rank of the
   protein ([315]Figure S2A).

   We compared the aggregated list of proteins obtained through our method
   of aggregation ([316]Figure S2A) with other aggregated lists which we
   calculated using other algorithms - robust rank aggregation ([317]Kolde
   et al., 2012), Stuart ([318]Stuart et al., 2003), BordaFuse ([319]Aslam
   and Montague, 2001) and, Markov Chain Aggregator (MC4) ([320]Dwork
   et al., 2001). To assess the performance of different aggregation
   methods, we used linear models wherein the mRNA-protein correlation of
   the three studies containing replicate proteomic profiles was regressed
   on the different aggregated lists of protein reproducibility. The
   aggregated list using our ‘average normalized rank’ approach could best
   explain the variation in mRNA-protein correlation in the colorectal
   cancer and CCLE studies, while the BordaFuse method best explained the
   variation in the ovarian cancer study ([321]Figure S2B), followed by
   our approach. As our ‘average normalized rank’ approach overall has the
   highest R-squared, we chose this method to aggregate the correlations
   of proteomic replicate profiles.

Linear regression models

   All linear regression was carried out using the statsmodel package in
   Python.

   Assessing the relationship between protein-protein reproducibility,
   mRNA-mRNA reproducibility, and mRNA-protein correlation
   ([322]Figure 6B)

   To understand the variance in mRNA-protein correlation explained by
   protein-protein reproducibility and mRNA-mRNA reproducibility we used
   three different linear models given by the equations -
     * •
       Protein-protein reproducibility only:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mi>a</mi></msub><mspa
       ce
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mspace width="0.25em"></mspace></mrow> :MATH]
     * •
       mRNA-mRNA reproducibility only:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><mi>m</mi><mrow><mo>(</mo><mi>g</mi><m
       o>)</mo></mrow></mrow> :MATH]
     * •
       Protein-protein reproducibility and mRNA-mRNA reproducibility:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><msub><mi>β</mi><mn>1</mn></msub><mo
       linebreak="goodbreak"
       linebreakstyle="after">∗</mo><mi>p</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><msub><mi>β</mi><mn>2</mn></msub><mo
       linebreak="goodbreak"
       linebreakstyle="after">∗</mo><mi>m</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       </mrow> :MATH]

   where
   [MATH: <mrow><mi>c</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mr
   ow> :MATH]
   is the mRNA-protein correlation for each protein,
   [MATH: <mrow><msub><mi>p</mi><mi>a</mi></msub><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mr
   ow> :MATH]
   is the aggregated protein reproducibility rank for each protein,
   [MATH: <mrow><mi>m</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mr
   ow> :MATH]
   is the mRNA-mRNA reproducibility of the corresponding transcript of
   each protein and the coefficients
   [MATH: <mrow><mi>α</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>β</mi><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>β</mi><mn>1</mn></msub></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi>β</mi><mn>2</mn></msub></mrow> :MATH]
   are computed using the ordinary least squares regression method. For
   all the models, mRNA-protein correlation is assessed over the same set
   of proteins in each study. R^2 is used to assess the predictive power
   of the explanatory variables in explaining the variation of the
   response variable.

   Assessing the relationship between protein-protein reproducibility and
   mRNA-protein correlation ([323]Figure S1A)

   To understand the variance in mRNA-protein correlation explained by
   protein complex membership and protein-protein reproducibility we used
   three different linear models given by the equations -
     * •
       Protein complex membership only:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><mi>p</mi><mi>c</mi><mi>m</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mspace width="0.25em"></mspace></mrow> :MATH]
     * •
       protein-protein reproducibility only:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><mi>p</mi><mrow><mo>(</mo><mi>g</mi><m
       o>)</mo></mrow></mrow> :MATH]
     * •
       Protein complex membership and protein-protein reproducibility:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><msub><mi>β</mi><mn>1</mn></msub><mo
       linebreak="goodbreak"
       linebreakstyle="after">∗</mo><mi>p</mi><mi>c</mi><mi>m</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><msub><mi>β</mi><mn>2</mn></msub><mo
       linebreak="goodbreak"
       linebreakstyle="after">∗</mo><mi>p</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       </mrow> :MATH]

   where
   [MATH: <mrow><mi>c</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mr
   ow> :MATH]
   is the mRNA-protein correlation for each protein,
   [MATH: <mrow><mi>p</mi><mi>c</mi><mi>m</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mr
   ow> :MATH]
   is the protein complex membership for each protein,
   [MATH:
   <mrow><mi>p</mi><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow>
   :MATH]
   is the protein-protein correlation for each protein and the
   coefficients
   [MATH: <mrow><mi>α</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>β</mi><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>β</mi><mn>1</mn></msub></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi>β</mi><mn>2</mn></msub></mrow> :MATH]
   are computed using the ordinary least squares regression method. The
   protein complex membership is indicated as 1 if a protein is a protein
   complex member, else 0. For all the models, mRNA-protein correlation is
   assessed over the same set of proteins in each study. R^2 is used to
   assess the predictive power of the explanatory variables in explaining
   the variation of the response variable.

   Assessing the ability of different aggregation approaches to rank
   protein-protein reproducibility ([324]Figure S2B)

   To identify the best aggregation method for protein-protein
   reproducibility, we compared the variance in mRNA-protein correlation
   explained by different aggregation methods using linear models given by
   the equations -
     * •
       Robust rank aggregation:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>r</mi><mi>r<
       /mi><mi>a</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></m
       row></mrow> :MATH]
     * •
       Stuart aggregation method:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>s</mi><mi>t<
       /mi><mi>u</mi><mi>a</mi><mi>r</mi><mi>t</mi></mrow></msub><mrow><mo
       >(</mo><mi>g</mi><mo>)</mo></mrow></mrow> :MATH]
     * •
       BordaFuse aggregation method:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>b</mi><mi>f<
       /mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow
       > :MATH]
     * •
       Markov chain aggregator 4:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>m</mi><mi>c<
       /mi><mn>4</mn></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></m
       row></mrow> :MATH]
     * •
       Average normalized rank:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mi>a</mi></msub><mrow
       ><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow> :MATH]

   where
   [MATH: <mrow><mi>c</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mr
   ow> :MATH]
   is the mRNA-protein correlation for each protein,
   [MATH:
   <mrow><msub><mi>p</mi><mrow><mi>r</mi><mi>r</mi><mi>a</mi></mrow></msub
   ><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>s</mi><mi>t</mi><mi>u
   </mi><mi>a</mi><mi>r</mi><mi>t</mi></mrow></msub><mrow><mo>(</mo><mi>g<
   /mi><mo>)</mo></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>b</mi><mi>f</mi></mro
   w></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>m</mi><mi>c</mi><mn>4
   </mn></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow>
   :MATH]
   and
   [MATH:
   <mrow><msub><mi>p</mi><mi>a</mi></msub><mrow><mo>(</mo><mi>g</mi><mo>)<
   /mo></mrow></mrow> :MATH]
   are the aggregated protein reproducibility ranks computed using robust
   rank aggregation, Stuart, BordaFuse, Markov chain aggregator 4 and
   average normalized ranks respectively for each protein. The
   coefficients
   [MATH: <mrow><mi>α</mi></mrow> :MATH]
   and
   [MATH: <mrow><mi>β</mi></mrow> :MATH]
   are computed using the ordinary least squares regression method. For
   all the models, mRNA-protein correlation is assessed over the same set
   of proteins in each study. R^2 is used to assess the predictive power
   of the explanatory variables in explaining the variation of the
   response variable.

   Comparing the ability if aggregated rank reproducibility to predict
   mRNA-protein correlation compared to reproducibility calculated in
   individual studies ([325]Figure S3)

   For each study, we compared four different models given by the
   equations -
     * •
       Ovarian protein reproducibility rank:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>o</mi><mi>v<
       /mi><mi>a</mi><mi>r</mi><mi>i</mi><mi>a</mi><mi>n</mi></mrow></msub
       ><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mspace
       width="0.25em"></mspace></mrow> :MATH]
     * •
       CCLE protein reproducibility rank:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>c</mi><mi>c<
       /mi><mi>l</mi><mi>e</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo
       >)</mo></mrow></mrow> :MATH]
     * •
       Colon protein reproducibility rank:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>c</mi><mi>o<
       /mi><mi>l</mi><mi>o</mi><mi>n</mi></mrow></msub><mrow><mo>(</mo><mi
       >g</mi><mo>)</mo></mrow></mrow> :MATH]
     * •
       Aggregated protein reproducibility rank:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mi>a</mi></msub><mrow
       ><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow> :MATH]

   where
   [MATH: <mrow><mi>c</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mr
   ow> :MATH]
   is the mRNA-protein correlation for each protein,
   [MATH: <mrow><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>o</mi><mi>v</mi><mi>a
   </mi><mi>r</mi><mi>i</mi><mi>a</mi><mi>n</mi></mrow></msub><mrow><mo>(<
   /mo><mi>g</mi><mo>)</mo></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>c</mi><mi>c</mi><mi>l
   </mi><mi>e</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow
   ><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>c</mi><mi>o</mi><mi>l
   </mi><mi>o</mi><mi>n</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)<
   /mo></mrow></mrow> :MATH]
   and
   [MATH:
   <mrow><msub><mi>p</mi><mi>a</mi></msub><mrow><mo>(</mo><mi>g</mi><mo>)<
   /mo></mrow></mrow> :MATH]
   are the aggregated protein reproducibility computed using the ovarian,
   CCLE and colon proteomic replicates individually and collectively
   respectively for each protein. The coefficients α and β are computed
   using the ordinary least squares regression method. For all the models,
   mRNA-protein correlation is assessed over the same set of proteins in
   each study. R^2 is used to assess the predictive power of the
   explanatory variables in explaining the variation of the response
   variable.

   Assessing the impact of protein measurement reproducibility on the
   accuracy of machine learning prediction of protein abundance
   ([326]Figure S4C)

   To understand the variation in protein prediction scores that can be
   explained by protein-protein reproducibility, we compared four
   different models on prediction scores of breast and ovarian tumour
   studies given by the equations -
     * •
       Ovarian protein reproducibility rank:
       [MATH:
       <mrow><msub><mi>p</mi><mrow><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi
       ><mi>e</mi><mi>s</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)<
       /mo></mrow><mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>o</mi><mi>v<
       /mi><mi>a</mi><mi>r</mi><mi>i</mi><mi>a</mi><mi>n</mi></mrow></msub
       ><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow> :MATH]
     * •
       CCLE protein reproducibility rank:
       [MATH:
       <mrow><msub><mi>p</mi><mrow><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi
       ><mi>e</mi><mi>s</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)<
       /mo></mrow><mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><mspace
       width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>c</mi><mi>c</mi><
       mi>l</mi><mi>e</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</m
       o></mrow></mrow> :MATH]
     * •
       Colon protein reproducibility rank:
       [MATH:
       <mrow><msub><mi>p</mi><mrow><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi
       ><mi>e</mi><mi>s</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)<
       /mo></mrow><mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>c</mi><mi>o<
       /mi><mi>l</mi><mi>o</mi><mi>n</mi></mrow></msub><mrow><mo>(</mo><mi
       >g</mi><mo>)</mo></mrow></mrow> :MATH]
     * •
       Aggregated protein reproducibility rank:
       [MATH:
       <mrow><msub><mi>p</mi><mrow><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi
       ><mi>e</mi><mi>s</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)<
       /mo></mrow><mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><mspace
       width="0.25em"></mspace><msub><mi>p</mi><mi>a</mi></msub><mrow><mo>
       (</mo><mi>g</mi><mo>)</mo></mrow></mrow> :MATH]

   where
   [MATH:
   <mrow><msub><mi>p</mi><mrow><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi
   >e</mi><mi>s</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mr
   ow></mrow> :MATH]
   is the prediction score that is the Pearson correlation between the
   predicted and actual protein abundance value obtained from the best
   predicting model in NCI CPTAC Proteogenomics DREAM challenge,
   [MATH:
   <mrow><msub><mi>p</mi><mrow><mi>o</mi><mi>v</mi><mi>a</mi><mi>r</mi><mi
   >i</mi><mi>a</mi><mi>n</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>
   )</mo></mrow><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>c</mi><mi>c</mi><mi>l
   </mi><mi>e</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow
   ><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>c</mi><mi>o</mi><mi>l
   </mi><mi>o</mi><mi>n</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)<
   /mo></mrow></mrow> :MATH]
   and
   [MATH:
   <mrow><msub><mi>p</mi><mi>a</mi></msub><mrow><mo>(</mo><mi>g</mi><mo>)<
   /mo></mrow></mrow> :MATH]
   are the aggregated protein reproducibility computed using the ovarian,
   CCLE and colon proteomic replicates individually and collectively
   respectively for each protein. The coefficients
   [MATH: <mrow><mi>α</mi></mrow> :MATH]
   and
   [MATH: <mrow><mi>β</mi></mrow> :MATH]
   are computed using the ordinary least squares regression method. For
   all the models, protein reproducibility rank is assessed over the same
   set of proteins in each study. R^2 is used to assess the predictive
   power of the explanatory variables in explaining the variation of the
   response variable.

   Assessing the impact of mRNA abundance, mRNA variance on the
   reproducibility of transcripts ([327]Figure S5E)

   To understand the variation in mRNA reproducibility explained by the
   potential factors (mRNA abundance, mRNA variance), we used two
   different linear models given by the equations -
     * •
       mRNA abundance only:
       [MATH: <mrow><mi>t</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>m</mi><mrow><mi>a</mi><mi>b<
       /mi><mi>u</mi><mi>n</mi><mi>d</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi
       >e</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></m
       row> :MATH]
     * •
       mRNA variance only:
       [MATH: <mrow><mi>t</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>m</mi><mrow><mi>v</mi><mi>a<
       /mi><mi>r</mi><mi>i</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>e</mi></m
       row></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow>
       :MATH]

   where
   [MATH: <mrow><mi>t</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><msp
   ace width="0.25em"></mspace></mrow> :MATH]
   is the transcript reproducibility correlation for each transcript,
   [MATH: <mrow><mspace
   width="0.25em"></mspace><msub><mi>m</mi><mrow><mi>a</mi><mi>b</mi><mi>u
   </mi><mi>n</mi><mi>d</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>e</mi></mrow
   ></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mspace
   width="0.25em"></mspace></mrow> :MATH]
   is the mRNA mean abundance for each transcript obtained from CCLE
   transcriptomic data,
   [MATH: <mrow><mspace
   width="0.25em"></mspace><msub><mi>m</mi><mrow><mi>v</mi><mi>a</mi><mi>r
   </mi><mi>i</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>e</mi></mrow></msub><m
   row><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow> :MATH]
   is the variance of the mRNA abundance for each transcript obtained from
   CCLE transcriptomic data and the coefficients α and β are computed
   using the ordinary least squares regression method.

   Assessing the impact of protein abundance, protein variance, unique
   peptides, protein half-lives and aggregated protein reproducibility on
   mRNA-protein correlation ([328]Figure S6)

   To understand the variance in mRNA-protein correlation explained by the
   factors (protein abundance, protein variance, unique peptides, and
   protein half-lives) influencing protein reproducibility, we used two
   different linear models given by the equations -
     * •
       Other factors ():

   [MATH: <mrow><mi>c</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo
   linebreak="badbreak">=</mo><mi>α</mi><mo
   linebreak="goodbreak">+</mo><mspace
   width="0.25em"></mspace><msub><mi>β</mi><mn>1</mn></msub><mo
   linebreak="goodbreak">∗</mo><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>a</mi><mi>b</mi><mi>u
   </mi><mi>n</mi><mi>d</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>e</mi></mrow
   ></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo
   linebreak="goodbreak">+</mo><mspace
   width="0.25em"></mspace><msub><mi>β</mi><mn>2</mn></msub><mo
   linebreak="goodbreak">∗</mo><msub><mi>p</mi><mrow><mi>v</mi><mi>a</mi><
   mi>r</mi><mi>i</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>e</mi></mrow></msu
   b><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo
   linebreak="goodbreak">+</mo><msub><mi>β</mi><mn>3</mn></msub><mo
   linebreak="goodbreak">∗</mo><msub><mi>p</mi><mrow><mi>p</mi><mi>e</mi><
   mi>p</mi><mi>t</mi><mi>i</mi><mi>d</mi><mi>e</mi><mi>s</mi></mrow></msu
   b><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo
   linebreak="goodbreak">+</mo><msub><mi>β</mi><mn>4</mn></msub><mo
   linebreak="goodbreak">∗</mo><msub><mi>p</mi><mrow><mi>h</mi><mi>a</mi><
   mi>l</mi><mi>f</mi><mo
   linebreak="badbreak">−</mo><mi>l</mi><mi>i</mi><mi>v</mi><mi>e</mi><mi>
   s</mi><mo
   linebreak="badbreak">−</mo><mi>l</mi><mi>o</mi><mi>n</mi><mi>g</mi></mr
   ow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo
   linebreak="goodbreak">+</mo><msub><mi>β</mi><mn>5</mn></msub><mo
   linebreak="goodbreak">∗</mo><msub><mi>p</mi><mrow><mi>h</mi><mi>a</mi><
   mi>l</mi><mi>f</mi><mo
   linebreak="badbreak">−</mo><mi>l</mi><mi>i</mi><mi>v</mi><mi>e</mi><mi>
   s</mi><mo
   linebreak="badbreak">−</mo><mi>s</mi><mi>h</mi><mi>o</mi><mi>r</mi><mi>
   t</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow>
   :MATH]
     * •
       Aggregated protein reproducibility:
       [MATH: <mrow><mi>c</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mi>a</mi></msub><mrow
       ><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow> :MATH]

   where
   [MATH: <mrow><mi>c</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mr
   ow> :MATH]
   is the mRNA-protein correlation for each protein,
   [MATH: <mrow><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>a</mi><mi>b</mi><mi>u
   </mi><mi>n</mi><mi>d</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>e</mi></mrow
   ></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mspace
   width="0.25em"></mspace></mrow> :MATH]
   is the protein abundance for each protein obtained from the GTEx
   project,
   [MATH: <mrow><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>v</mi><mi>a</mi><mi>r
   </mi><mi>i</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>e</mi></mrow></msub><m
   row><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow> :MATH]
   is the variance of the protein abundance for each protein obtained from
   the GTEx project,
   [MATH: <mrow><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>p</mi><mi>e</mi><mi>p
   </mi><mi>t</mi><mi>i</mi><mi>d</mi><mi>e</mi><mi>s</mi></mrow></msub><m
   row><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mspace
   width="0.25em"></mspace></mrow> :MATH]
   is the number of unique peptides for each protein obtained from the
   GTEx project,
   [MATH:
   <mrow><msub><mi>p</mi><mrow><mi>h</mi><mi>a</mi><mi>l</mi><mi>f</mi><mo
   >−</mo><mi>l</mi><mi>i</mi><mi>v</mi><mi>e</mi><mi>s</mi><mo>−</mo><mi>
   l</mi><mi>o</mi><mi>n</mi><mi>g</mi></mrow></msub><mrow><mo>(</mo><mi>g
   </mi><mo>)</mo></mrow></mrow> :MATH]
   and
   [MATH:
   <mrow><msub><mi>p</mi><mrow><mi>h</mi><mi>a</mi><mi>l</mi><mi>f</mi><mo
   >−</mo><mi>l</mi><mi>i</mi><mi>v</mi><mi>e</mi><mi>s</mi><mo>−</mo><mi>
   s</mi><mi>h</mi><mi>o</mi><mi>r</mi><mi>t</mi></mrow></msub><mrow><mo>(
   </mo><mi>g</mi><mo>)</mo></mrow><mspace width="0.25em"></mspace></mrow>
   :MATH]
   are the half-lives of each protein (long and short),
   [MATH:
   <mrow><msub><mi>p</mi><mi>a</mi></msub><mrow><mo>(</mo><mi>g</mi><mo>)<
   /mo></mrow></mrow> :MATH]
   are the aggregated protein reproducibility computed using the ovarian,
   CCLE and colon proteomic replicates individually and collectively
   respectively for each protein and the coefficients:
   [MATH: <mrow><mspace
   width="0.25em"></mspace><mi>α</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>β</mi><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>β</mi><mn>1</mn></msub><mo>,</mo><msp
   ace
   width="0.25em"></mspace><msub><mi>β</mi><mn>2</mn></msub><mo>,</mo><msp
   ace
   width="0.25em"></mspace><msub><mi>β</mi><mn>3</mn></msub><mo>,</mo><msp
   ace width="0.25em"></mspace><msub><mi>β</mi><mn>4</mn></msub></mrow>
   :MATH]
   and
   [MATH: <mrow><msub><mi>β</mi><mn>5</mn></msub></mrow> :MATH]
   are computed using the ordinary least squares regression method. For
   all the models, mRNA-protein correlation is assessed over the same set
   of proteins in each study. R^2 is used to assess the predictive power
   of the explanatory variables in explaining the variation of the
   response variable.

Rank regression

   We used rank regression to assess the contribution of various factors
   (protein abundance, unique peptides, and protein half-lives) to
   explaining the variance in protein measurement reproducibility. We
   assessed both the aggregated ranks and the reproducibility measured in
   each individual study. We converted the protein reproducibility
   measurements from the three studies with replicates (ovarian, colon,
   CCLE) to ranks.

   The potential factors such as protein abundance and unique peptides had
   a large range, therefore both the factors were log transformed and
   linear regression was performed.

   Assessing the impact of protein abundance, protein variance, unique
   peptides, protein half-lives on the reproducibility of proteins
   ([329]Figure S5D)

   To understand the variance in protein reproducibility ranks explained
   by the potential factors (protein abundance, protein variance, unique
   peptides, and protein half-lives), we used four different linear models
   given by the equations -
     * •
       Protein abundance only:
       [MATH: <mrow><mi>r</mi><mi>a</mi><mi>n</mi><mi>k</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>a</mi><mi>b<
       /mi><mi>u</mi><mi>n</mi><mi>d</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi
       >e</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></m
       row> :MATH]
     * •
       Protein variance only:
       [MATH: <mrow><mi>r</mi><mi>a</mi><mi>n</mi><mi>k</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>v</mi><mi>a<
       /mi><mi>r</mi><mi>i</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>e</mi></m
       row></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow>
       :MATH]
     * •
       Unique peptides only:
       [MATH: <mrow><mi>r</mi><mi>a</mi><mi>n</mi><mi>k</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mi>β</mi><mspace
       width="0.25em"></mspace><mo linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>p</mi><mi>e<
       /mi><mi>p</mi><mi>t</mi><mi>i</mi><mi>d</mi><mi>e</mi><mi>s</mi></m
       row></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow>
       :MATH]
     * •
       Protein half-lives encoded as long and short:
       [MATH: <mrow><mi>r</mi><mi>a</mi><mi>n</mi><mi>k</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><msub><mi>β</mi><mn>1</mn></msub><mo
       linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>h</mi><mi>a<
       /mi><mi>l</mi><mi>f</mi><mo>−</mo><mi>l</mi><mi>i</mi><mi>v</mi><mi
       >e</mi><mi>s</mi><mo>−</mo><mi>l</mi><mi>o</mi><mi>n</mi><mi>g</mi>
       </mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo
       linebreak="goodbreak"
       linebreakstyle="after">+</mo><msub><mi>β</mi><mn>2</mn></msub><mo
       linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>h</mi><mi>a<
       /mi><mi>l</mi><mi>f</mi><mo>−</mo><mi>l</mi><mi>i</mi><mi>v</mi><mi
       >e</mi><mi>s</mi><mo>−</mo><mi>s</mi><mi>h</mi><mi>o</mi><mi>r</mi>
       <mi>t</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       </mrow> :MATH]
     * •
       Protein abundance, unique peptides and protein half-lives combined:
       [MATH: <mrow><mi>r</mi><mi>a</mi><mi>n</mi><mi>k</mi><mspace
       width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       <mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>α</mi><mo linebreak="goodbreak"
       linebreakstyle="after">+</mo><mspace
       width="0.25em"></mspace><msub><mi>β</mi><mn>1</mn></msub><mo
       linebreak="goodbreak" linebreakstyle="after">∗</mo><mspace
       width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>a</mi><mi>b</mi><
       mi>u</mi><mi>n</mi><mi>d</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>e</m
       i></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo
       linebreak="goodbreak" linebreakstyle="after">+</mo><mspace
       width="0.25em"></mspace><msub><mi>β</mi><mn>2</mn></msub><mo
       linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>v</mi><mi>a<
       /mi><mi>r</mi><mi>i</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>e</mi></m
       row></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo
       linebreak="goodbreak"
       linebreakstyle="after">+</mo><msub><mi>β</mi><mn>3</mn></msub><mo
       linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>p</mi><mi>e<
       /mi><mi>p</mi><mi>t</mi><mi>i</mi><mi>d</mi><mi>e</mi><mi>s</mi></m
       row></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo
       linebreak="goodbreak"
       linebreakstyle="after">+</mo><msub><mi>β</mi><mn>4</mn></msub><mo
       linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>h</mi><mi>a<
       /mi><mi>l</mi><mi>f</mi><mo>−</mo><mi>l</mi><mi>i</mi><mi>v</mi><mi
       >e</mi><mi>s</mi><mo>−</mo><mi>l</mi><mi>o</mi><mi>n</mi><mi>g</mi>
       </mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo
       linebreak="goodbreak"
       linebreakstyle="after">+</mo><msub><mi>β</mi><mn>5</mn></msub><mo
       linebreak="goodbreak"
       linebreakstyle="after">∗</mo><msub><mi>p</mi><mrow><mi>h</mi><mi>a<
       /mi><mi>l</mi><mi>f</mi><mo>−</mo><mi>l</mi><mi>i</mi><mi>v</mi><mi
       >e</mi><mi>s</mi><mo>−</mo><mi>s</mi><mi>h</mi><mi>o</mi><mi>r</mi>
       <mi>t</mi></mrow></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow>
       </mrow> :MATH]

   where
   [MATH: <mrow><mi>r</mi><mi>a</mi><mi>n</mi><mi>k</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><msp
   ace width="0.25em"></mspace></mrow> :MATH]
   is the protein reproducibility rank for each protein,
   [MATH: <mrow><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>a</mi><mi>b</mi><mi>u
   </mi><mi>n</mi><mi>d</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>e</mi></mrow
   ></msub><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mspace
   width="0.25em"></mspace></mrow> :MATH]
   is the protein abundance for each protein obtained from the GTEx
   project,
   [MATH: <mrow><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>v</mi><mi>a</mi><mi>r
   </mi><mi>i</mi><mi>a</mi><mi>n</mi><mi>c</mi><mi>e</mi></mrow></msub><m
   row><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mrow> :MATH]
   is the variance of the protein abundance for each protein obtained from
   the GTEx project,
   [MATH: <mrow><mspace
   width="0.25em"></mspace><msub><mi>p</mi><mrow><mi>p</mi><mi>e</mi><mi>p
   </mi><mi>t</mi><mi>i</mi><mi>d</mi><mi>e</mi><mi>s</mi></mrow></msub><m
   row><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mspace
   width="0.25em"></mspace></mrow> :MATH]
   is the number of unique peptides for each protein obtained from the
   GTEx project,
   [MATH:
   <mrow><msub><mi>p</mi><mrow><mi>h</mi><mi>a</mi><mi>l</mi><mi>f</mi><mo
   >−</mo><mi>l</mi><mi>i</mi><mi>v</mi><mi>e</mi><mi>s</mi><mo>−</mo><mi>
   l</mi><mi>o</mi><mi>n</mi><mi>g</mi></mrow></msub><mrow><mo>(</mo><mi>g
   </mi><mo>)</mo></mrow></mrow> :MATH]
   and
   [MATH:
   <mrow><msub><mi>p</mi><mrow><mi>h</mi><mi>a</mi><mi>l</mi><mi>f</mi><mo
   >−</mo><mi>l</mi><mi>i</mi><mi>v</mi><mi>e</mi><mi>s</mi><mo>−</mo><mi>
   s</mi><mi>h</mi><mi>o</mi><mi>r</mi><mi>t</mi></mrow></msub><mrow><mo>(
   </mo><mi>g</mi><mo>)</mo></mrow><mspace width="0.25em"></mspace></mrow>
   :MATH]
   are the half-lives of each protein (long and short) and the
   coefficients
   [MATH: <mrow><mi>α</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>β</mi><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mi>β</mi><mn>1</mn></msub><mo>,</mo><msp
   ace
   width="0.25em"></mspace><msub><mi>β</mi><mn>2</mn></msub><mo>,</mo><msp
   ace
   width="0.25em"></mspace><msub><mi>β</mi><mn>3</mn></msub><mo>,</mo><msp
   ace width="0.25em"></mspace><msub><mi>β</mi><mn>4</mn></msub></mrow>
   :MATH]
   and
   [MATH: <mrow><msub><mi>β</mi><mn>5</mn></msub></mrow> :MATH]
   are computed using the ordinary least squares regression method. For
   all the models, protein reproducibility is assessed over the same set
   of proteins in each study. R^2 is used to assess the predictive power
   of the explanatory variables in explaining the variation of the
   response variable.

Pathway enrichment analysis

   Pathway enrichment analysis was performed using the Mann-Whitney U
   test. Firstly, the KEGG pathways ([330]Kanehisa et al., 2021) and their
   associated genes for Homo sapiens were downloaded using the KEGG API
   ([331]https://www.kegg.jp/kegg/rest/keggapi.html). Only KEGG pathways
   with more than 3 genes with measured correlations were included for the
   enrichment analysis. The computed mRNA-protein correlations of CCLE and
   ovarian cancer studies were used to rank the proteins. A Mann-Whitney U
   test was performed to assess the rank of each pathway in each dataset
   and p-values obtained were corrected for false discovery rate (FDR)
   using the Benjamini-Hochberg method. For the figures presented in
   [332]Figure 7 and [333]S7 we specifically included pathways which have
   been previously identified as enriched in different cancer studies
   ([334]Clark et al., 2019; [335]Huang et al., 2021; [336]Mertins et al.,
   2016; [337]Zhang et al., 2014, [338]2016). To identify enriched
   pathways after accounting for experimental reproducibility, we
   regressed the CCLE and ovarian mRNA-protein correlation on both
   aggregated protein reproducibility ranks and mRNA-mRNA reproducibility
   correlations, which are based on the equations
   [MATH: <mrow><mi>c</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><mo
   linebreak="goodbreak" linebreakstyle="after">=</mo><mi>α</mi><mo
   linebreak="goodbreak"
   linebreakstyle="after">+</mo><msub><mi>β</mi><mn>1</mn></msub><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mi>m</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow><msp
   ace width="0.25em"></mspace><mo linebreak="goodbreak"
   linebreakstyle="after">+</mo><mspace
   width="0.25em"></mspace><msub><mi>β</mi><mn>2</mn></msub><mo
   linebreak="goodbreak" linebreakstyle="after">∗</mo><mi>p</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mr
   ow> :MATH]
   , where
   [MATH: <mrow><mi>c</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mr
   ow> :MATH]
   is the mRNA-protein correlation,
   [MATH: <mrow><mi>m</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mr
   ow> :MATH]
   is the mRNA-mRNA reproducibility and
   [MATH: <mrow><mi>p</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mi>g</mi><mo>)</mo></mrow></mr
   ow> :MATH]
   is the protein-protein reproducibility and the coefficients
   [MATH:
   <mrow><mi>α</mi><mo>,</mo><msub><mi>β</mi><mn>1</mn></msub></mrow>
   :MATH]
   and
   [MATH: <mrow><msub><mi>β</mi><mn>2</mn></msub></mrow> :MATH]
   are computed based on the ordinary least squares regression method. The
   residuals obtained from the regression were used to rank the proteins
   in pathway enrichment analysis. The top level categories (e.g.,
   Metabolism, Genetic Information Processing) of the pathways were
   obtained from KEGG and are used to annotate the pathways in
   [339]Figure 7 and [340]S7.

Quantification and statistical analysis

   Statistical analysis is described in the [341]Method details and was
   carried out using Python 3.8, Pandas 1.2.5 ([342]McKinney, 2011), numpy
   1.20.2 ([343]Harris et al., 2020), SciPy 1.7.1 ([344]Virtanen et al.,
   2020) and StatsModels 0.12.2 ([345]Seabold and Perktold, 2010). The
   figures were created with Matplotlib 3.3.4 ([346]Hunter, 2007) and
   Seaborn 0.11.1 ([347]Waskom, 2021).

Acknowledgments