Abstract

   Introducing novel biomarkers for accurately detecting and
   differentiating rheumatoid arthritis (RA) and osteoarthritis (OA) using
   clinical samples is essential. In the current study, we searched for a
   novel data-driven gene signature of synovial tissues to differentiate
   RA from OA patients. Fifty-three RA, 41 OA, and 25 normal
   microarray-based transcriptome samples were utilized. The area under
   the curve random forests (RF) variable importance measurement was
   applied to seek the most influential differential genes between RA and
   OA. Five algorithms including RF, k-nearest neighbors (kNN), support
   vector machines (SVM), naïve-Bayes, and a tree-based method were
   employed for the classification. We found a 16-gene signature that
   could effectively differentiate RA from OA, including TMOD1, POP7,
   SGCA, KLRD1, ALOX5, RAB22A, ANK3, PTPN3, GZMK, CLU, GZMB, FBXL7,
   TNFRSF4, IL32, MXRA7, and CD8A. The externally validated accuracy of
   the RF model was 0.96 (sensitivity = 1.00, specificity = 0.90).
   Likewise, the accuracy of kNN, SVM, naïve-Bayes, and decision tree was
   0.96, 0.96, 0.96, and 0.91, respectively. Functional meta-analysis
   exhibited the differential pathological processes of RA and OA;
   suggested promising targets for further mechanistic and therapeutic
   studies. In conclusion, the proposed genetic signature combined with
   sophisticated classification methods may improve the diagnosis and
   management of RA patients.

   Keywords: rheumatoid arthritis, osteoarthritis, diagnostic biomarker,
   machine learning, meta-analysis, pathway analysis

1. Introduction

   In accordance with the dramatically increased incidence in the older
   population, osteoarthritis (OA) and rheumatoid arthritis (RA) are
   currently among the most common causes of musculoskeletal-related
   chronic disability [[44]1,[45]2]. Depending on the case definition and
   joint sites under study, the prevalence of RA was at 0.5–1.1% while
   that of OA was much more common, ranging from 5% of the hip and 33% of
   the knee to 60% of the hands in adults 65 years of age or older
   [[46]3,[47]4]. RA is a chronic autoimmune disease that exhibits
   persistent synovial and systematic inflammation along with the
   existence of autoantibodies [[48]5]. On the other hand, OA has been
   characterized as a non-inflammatory degenerative joint disease although
   synovial inflammation is a debatably important feature [[49]6,[50]7].
   OA and RA are pathophysiologically different but share similar and
   overlapping features in terms of underlying mechanisms [[51]8,[52]9].
   In addition, early-stage RA appears to be remarkably similar to other
   forms of arthritis, especially OA; thus, more investigations are needed
   to introduce better approaches for differentiating RA and OA [[53]10].

   Although laboratory tests, prominently blood test, have been believed
   to provide the most important information for diagnosis of arthritis,
   30% of RA patients had negative rheumatoid factor or anticyclic
   citrullinated peptide antibodies while 40% got a normal result for
   erythrocyte sedimentation rate or C-reactive protein [[54]11]. The
   values indicate that complementary and alternative methods are needed
   to increase the diagnostic accuracy of RA. Among them, synovial tissue
   has been used for the diagnosis and study of arthritis for decades and
   its substantial impact was recently reviewed [[55]12,[56]13]. Several
   studies have suggested specific immunohistochemical features of
   synovial tissue that separate RA (or OA) from other joint diseases or
   the early stage from the late stage of disease [[57]14,[58]15,[59]16].
   However, a recent study indicated that only a minority of patients
   could have a definitive diagnosis despite the high success rate of
   gathering synovial tissue [[60]17]. This reduces the potential of using
   synovial tissue not only for abnormal detection but also for proper
   guidance of treatments. Moreover, investigations exploring small
   subsets of genes that can be used to classify RA versus OA, RA versus
   healthy persons and OA versus healthy persons using synovial tissues
   are lacking.

   Herein, we searched for genome-wide transcriptional profiles from the
   synovial tissues of RA, OA, and normal synovial samples and looked for
   genetic signatures that were suitable for accurately differentiating RA
   from OA as well as RA from normal tissues and OA from normal tissues.
   The results from this investigation are expected to be helpful in some
   particular populations, such as elder patients with inflammatory
   presentation of OA and patients without typical biomarkers. In
   addition, we applied gene expression meta-analysis and functional
   analysis to explore the different biological processes involved in the
   pathophysiology of RA and OA. The bioinformatics-based suggestive
   mechanisms would be helpful for further mechanistic and therapeutic
   studies. In conclusion, our findings may assist the blood-based
   laboratory tests to improve the accuracy of diagnosis and success rate
   of clinical interventions.

2. Materials and Methods

2.1. Data Sets

   The available Affymetrix and Illumina microarrays on gene expression in
   RA, OA, and healthy control (N) synovial tissues were curated from Gene
   Expression Omnibus with the search terms “Rheumatoid arthritis”,
   “Osteoarthritis”, “Synovial tissue*”, and “Synovial membrane”. As a
   result, the following data sets were included: [61]GSE1919 (3 RA, 3 OA,
   and 5 N), [62]GSE39340 (10 RA and 7 OA), [63]GSE36700 (7 RA and 5 OA),
   [64]GSE55457 (13 RA, 10 OA, and 10 N), [65]GSE55584 (10 RA and 6 OA),
   [66]GSE55235 (10 RA, 10 OA, and 10 N). Synovial tissues mixed from
   different patients in the [67]GSE1919 data set were excluded before
   data processing and analysis. In total, the data sets on RA and OA
   included 94 patients. The comparisons between OA and normal controls
   and between RA and normal controls consisted of 48 and 51 samples,
   respectively. Information on the included data sets is presented in
   [68]Table S1. We then divided the six data sets into two groups: set A
   (for variable selection purpose) consisted of [69]GSE1919,
   [70]GSE39340, and [71]GSE36700; set B (for model fitting purpose)
   consisted of [72]GSE55457, [73]GSE55584, and [74]GSE55235.

2.2. Microarray Data Preprocessing

   Robust multiarray average normalization in affy package version 1.54.0
   was conducted for Affymetrix-based datasets, and robust spline
   normalization in lumi package version 2.28.0 was applied for Illumina
   beadchip arrays [[75]18,[76]19]. The probes were mapped according to
   Entrez Gene ID and official gene nomenclature. Subsequently, the
   individual datasets were merged after removing batch effects by
   empirical the Bayes cross-study normalization method [[77]20].
   Following cross-study normalization, only common genes among different
   platforms, e.g., Affymetrix and Illumina, and datasets remained (8597
   genes).

2.3. Variable Importance Measurement and Selection

   In this study, the area under the curve permutation random forest
   variable importance measurement (AUC-RF VIM) was applied. AUC-RF VIM
   was conducted using party package version 1.2-3 [[78]21]. Only the
   candidates that had a variable importance value higher than our
   arbitrary criterion were selected for the classification tasks.

2.4. Exploratory Data Analysis and Visualization

   We performed the Wilcoxon rank-sum test to compare the differences of
   individual predictors between two groups. Additionally, principal
   component analysis (PCA), a data dimensionality reduction technique,
   was utilized to explore the underlying trends and patterns of data as
   well as to detect possible outliers of two comparative groups [[79]22].
   Using Metaboanalyst 3.0, a heatmap of the centered and scaled data was
   also applied to visualize the differences in gene expression between
   two groups [[80]23]. Violin plots with overlay box plots of the
   predictors were constructed using violin plots with ggpubR version
   0.1.5 [[81]24]. PCA and visualization were performed using FactoMineR
   version 1.35 and factoextra version 1.0.4 [[82]25,[83]26].

2.5. Supervised Machine Learning

   RA and OA were classified using classification and regression training
   (caret) package version 6.0.77 [[84]27]. Random forests (RF), an
   aggregation or ensemble of numerous decision trees, each of which
   utilizes resampled data and branches out based on a randomly chosen
   subset of features, was applied as the main classifier in the current
   study [[85]28]. Four other commonly used machine learning
   classification techniques that belong to different categories of
   algorithms were also employed: k-nearest neighbors (kNN), naïve Bayes,
   support vector machines with polynomial kernel (SVM), and a tree-based
   method (we call it C5.0, named after the R function). Each classifier
   has one or several tuning parameters that need to be optimized. Grid
   search or random search with the tune length of 10 were applied when
   applicable [[86]29]. First, we set the number of trees consisting of
   random forests to 500, which is known to suffice and thus is commonly
   used in practice. Additionally, the number of randomly chosen variables
   at each split was tuned from one to 16 in RF. The naïve Bayes
   classifier was optimized with a Gaussian kernel of data-driven
   bandwidth. The number of closest training examples (k) in kNN was tuned
   to define the value of k that resulted in optimal performance of the
   model. It is well known that the SVM classifier can be found by solving
   a quadratic programming via the dual of the primal problem where the
   cost constant
   [MATH: <mrow><mrow><mi
   mathvariant="normal">C</mi><mo>></mo><mn>0</mn></mrow></mrow> :MATH]
   penalizes the margins. In our analysis, the SVM classifier relied on
   the polynomial kernel represented by
   [MATH: <mrow><mrow><mi
   mathvariant="normal">K</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">x</mi><mo>,</mo><mi
   mathvariant="normal">y</mi></mrow><mo>)</mo></mrow><mo>=</mo><mi>γ</mi>
   <msup><mrow><mrow><mo>(</mo><mrow><msup><mi>x</mi><mi>T</mi></msup><mi>
   y</mi><mo>+</mo><mn>1</mn></mrow><mo>)</mo></mrow></mrow><mi>d</mi></ms
   up><mo>,</mo><mo> </mo><mi>x</mi><mo>,</mo><mi>y</mi><mo> </mo><mo>∈</m
   o><msup><mo>ℝ</mo><mi mathvariant="normal">p</mi></msup></mrow></mrow>
   :MATH]

   where
   [MATH:
   <mrow><mrow><mi>γ</mi><mo>></mo><mn>0</mn><mo>,</mo><mtext> </mtext><mi
   >d</mi><mo>≥</mo><mn>1</mn></mrow></mrow> :MATH]
   are the scale parameter and degree of the polynomial, respectively. The
   tuning parameters
   [MATH: <mrow><mrow><mi
   mathvariant="normal">C</mi><mo>,</mo><mtext> </mtext><mi>γ</mi><mo>,</m
   o><mi>d</mi></mrow></mrow> :MATH]
   were tuned using the random search method. The C5.0 function in R
   supports different tree-based methods; it is, for example, a single
   tree if ‘trials’ is set to one, or a boosting, analogous to AdaBoost
   with some tweaks, otherwise. The model can be further improved by
   considering whether to use a rule-based tree or a classification tree
   and whether to filter unimportant variables in advance or not. The
   aforementioned options were all determined using the random search
   method. Performance indicators of the model were evaluated in the test
   set using accuracy, specificity, and sensitivity. A seed number was
   used prior to the data splitting and model training to obtain
   reproducible results. A receiver operating characteristic (ROC) curve
   of the performance of the optimal RF model on the test set was
   visualized using pROC package version 1.10.0 [[87]30]. Visualization of
   the top 10 important genes of the optimal RF model was performed using
   GraphPad Prism 6 (GraphPad Software Inc., San Diego, CA, USA). In
   addition, a quick assessment of the classification of RA and normal
   tissues as well as OA and normal tissues was conducted using
   ArrayMining [[88]31].

2.6. Random Forests Classification Model Explanation

   We applied local interpretable model-agnostic explanations (R package
   version 0.3.0) in order to explain the prediction decisions of the
   optimal RF classifier regarding the most important features [[89]32]. A
   forward selection algorithm with six features was utilized to select
   the most relevant features. The labels of two were established to
   observe the explanation of RA and OA. Other parameters were kept as
   default.

2.7. Gene Expression Meta-Analysis

   Gene expression meta-analyses between RA and OA, RA and normal tissues,
   and OA and normal tissues were performed by combined effects size. The
   statistical heterogeneity was first estimated by Cochran’s Q test. A
   fixed effects model was used when estimated Q values have approximately
   followed a chi-square distribution; otherwise, a random effects model
   was applied. The analysis was conducted in accordance with the
   published protocol of Networkanalyst [[90]20].

2.8. Pathway Enrichment Analysis and Protein Association Network

   Gene Ontology and gene set enrichment analysis was conducted using the
   STRING database with default settings [[91]33]. All differentially
   expressed genes that had combined effects size greater than 1.5 were
   input to the software for the enrichment analyses using GO biological
   processes and Kyoto Encyclopedia of Genes and Genomes annotation. The
   protein–protein interaction network derived from the STRING
   knowledgebase was also visualized.

2.9. Statistical Significance

   Where applicable, a p-value of 0.05 or a false discovery rate FDR of
   0.05 was used for null-hypothesis testing. The R statistic version
   3.4.2 was used as the environment for data processing, machine learning
   classification and visualization unless specifically indicated
   [[92]34].

2.10. Ethics and Data Availability

   Ethical application has been waived by Seoul National University
   Institutional Review Board (SNU 17-08-047). The datasets analyzed
   during the current study are available in the GEO repository.
   [93]GSE1919, [94]GSE39340, [95]GSE36700, [96]GSE55457, [97]GSE55584,
   [98]GSE55235.

3. Results

3.1. Variable Importance Measurement and Selection

   Error rate-based (ER)-RF variable importance measurement (VIM) has been
   recommended for feature selection in high-dimensional data [[99]35].
   Nevertheless, a previous investigation demonstrated that area under the
   curve (AUC)-RF VIM outperforms ER-RF VIM, especially in unbalanced
   class problems [[100]36]. Hence, we applied AUC-RF for variable
   importance measurement to estimate the score of every gene with the
   corresponding conditions. The workflow of variable selection is shown
   in [101]Figure 1a. The measurement was conducted using cross-study
   normalized data set A derived from three different data sets:
   [102]GSE1919, [103]GSE39340, and [104]GSE36700 ([105]Table S1). These
   available data sets came from various populations (one Asian and two
   European) and several microarray platforms (Affymetrix U95A, Illumina V
   4.0, and Affymetrix 133 Plus 2.0) with a total number of 35 samples
   were used for variable importance measurement. A larger set of data was
   used for modeling and testing (described below). The batch effects
   removed data sets were visualized using principal component analysis
   (PCA) ([106]Figure S1). As a result, 381 genes exhibited an importance
   score to the classification of RA and OA greater than zero. Because the
   package does not provide a handy criterion for selecting the
   candidates, we finally selected 16 genes with a score of 1.5 × 10^−3 or
   higher as potential variables for the classification of RA and OA in
   unseen cross-study normalized data set B ([107]Figure 1b). The
   selection was arbitrary to some extent. However, we intended to select
   the smallest genetic signature that could effectively discriminate RA
   from OA. A redundant list of genetic signatures—e.g., hundreds of
   genes—would make it unimaginable for finding the rationales of machine
   learning-based classification as well as clinical interpretation.
   Detailed information about the proposed genetic biomarkers can be found
   in [108]Table 1.

Figure 1.

   [109]Figure 1
   [110]Open in a new tab

   Workflow of variable selection and machine learning classification. (a)
   Variable selection workflow. (b) Classification analysis workflow. RF:
   random forests; SVM: support vector machines; kNN: k nearest neighbors.
   [111]GSE55457, [112]GSE55584, and [113]GSE55235 are from the same study
   and were used for modeling and validation of the data-driven proposed
   biomarkers.

Table 1.

   The 16-gene signature derived from AUC-RF based variable selection.
   Entry ID Approved Symbol Approved Name Chromosomal Location Combined
   Effects Size ^1
   7111 TMOD1 Tropomodulin 1 9q22.33 −2.16
   10248 POP7 POP7 homolog, ribonuclease P/MRP subunit 7q22.1 −0.87
   6442 SGCA Sarcoglycan alpha 17q21.33 −2.74
   3824 KLRD1 Killer cell lectin like receptor D1 12p13 1.35
   240 ALOX5 Arachidonate 5-lipoxygenase 10q11.21 1.36
   57403 RAB22A RAB22A, member RAS oncogene family 20q13.32 −1.20
   288 ANK3 Ankyrin 3 10q21.2 −1.92
   5774 PTPN3 Protein tyrosine phosphatase, non-receptor type 3 9q31 −1.31
   3003 GZMK Granzyme K 5q11.2 2.92
   1191 CLU Clusterin 8p21.1 −2.17
   3002 GZMB Granzyme B 14q12 2.61
   23194 FBXL7 F-box and leucine rich repeat protein 7 5p15.1 −0.93
   7293 TNFRSF4 TNF receptor superfamily member 4 1p36.33 1.26
   9235 IL32 Interleukin 32 16p13.3 2.73
   439921 MXRA7 Matrix remodeling associated 7 17q25.1 −2.19
   925 CD8A CD8a molecule 2p11.2 2.83
   [114]Open in a new tab

   ^1: Adopted from the the section of functional meta-analysis, which was
   independent to the biomarker selection.

3.2. RA and OA can be Accurately Differentiated Using a 16-Gene Signature

   To validate our 16-gene signature, we gathered three data sets having a
   roughly double number of samples that came from the same study in set B
   ([115]GSE55457, [116]GSE55584, and [117]GSE55235) [[118]9]. It is of
   importance to note that the authors of the previous study built their
   own model and obtained good performance in these data sets, and we
   aimed to show that our signature, derived from a powerful statistical
   learning method, could also achieve comparative results. In set B,
   descriptive statistics and univariate data exploratory analysis using
   the Wilcoxon rank-sum test were performed. There were 59 samples in
   total, of which OA and RA accounted for 26 samples (44.1%) and 33
   samples (55.9%), respectively. The gene expression of all predictors
   was significantly different (false discovery rate (FDR) < 0.05). The
   mean, median, and other values as well as the violin plots of gene
   expression for the 16 predictors can be found in Data S1. Since
   univariate analysis could not assess the correlations among predictors,
   PCA was then conducted to obtain better insights into the data. Our
   signature was actually well adapted to the two merged data sets as
   shown in PCA analysis. The first PC explained more than 50% variance of
   the data, and five major PCs cumulatively explained up to 81.9% of the
   variance in the data. The two classes were relatively well separated in
   2D score plot PCA (PC1 + PC2 = 62.3%). Variables having a strong
   positive or negative effect on PC1 can be considered essential for
   classification. The three attributes with the highest absolute values
   were TMOD1, CLU, and MXRA7 among the positive factors and GZMK, GZMB,
   and CD8A among negative factors. Moreover, as will be shown later in
   this section, CLU and MXRA7 were characterized as important covariates
   in prediction. Despite these clear findings, however, there was an
   overlapping region between the two groups, which might make the
   classification of those cases greatly complicated ([119]Figure 2a).
   Most of the samples in the two groups distributed within 95% confidence
   regions for each group, but there were three potential outliers, one
   from RA and two from OA. Heatmap analysis was also conducted without
   re-organization to observe the contrasting appearance between two
   groups ([120]Figure 2b).

Figure 2.

   [121]Figure 2
   [122]Open in a new tab

   PCA and heatmap analysis of the data set with the 16-gene signature.
   (a) The sum of the two principal component is 62.3%. (b) Seven genes
   are upregulated, and nine genes are downregulated in RA compared to OA.
   PCA: Principal component analysis; RA: Rheumatoid arthritis; OA:
   osteoarthritis.

   There were two relatively clear expression patterns in the heatmap. The
   gene expression levels of ALOX5, TNFRSF4, GZMB, IL32, KLRD1, GZMK, and
   CD8A were generally higher in RA than in OA. In contrast, the gene
   expression levels of RAB22A, PTPN3, FBXL7, MXRA7, POP7, ANK3, SGCA,
   TMOD1, and CLU were generally higher in OA than in RA. For class
   assignment analysis, data set B was split into training (60%) and test
   (40%) sets. Data partition was conducted and the balance of class
   distribution remained between two comparative groups. The training
   process to estimate the optimal RF model was conducted using grid
   search; the mtry parameter ranged from one to 16, and the 10-fold
   cross-validation 5-time repeats method was used. The area under the
   receiver operating characteristic curve (AUCROC) was used to determine
   the optimal model. Eventually, the model that had an mtry of 12
   exhibited an excellent performance (AUCROC = 1.00) in the training set.
   Thereafter, we applied the optimal RF model to the external test set.
   The performance of the RF model was excellent; only one sample was
   misclassified ([123]Figure 3a). RF variable importance measurement
   revealed that IL32, SGCA, MXRA7, and CLU were the four most important
   features of the prediction model ([124]Figure 3b).

Figure 3.

   [125]Figure 3
   [126]Open in a new tab

   Properties of the random forests model. (a) The ROC curve, accuracy,
   sensitivity, and specificity of the external the test set of the
   optimal random forest model. (b) The top 10 most important features of
   the optimal random forests model on the training set.

   Local interpretable model-agnostic explanations (LIME) was applied in
   three representative cases (one correctly predicted OA sample, one
   correctly predicted RA sample, and one OA sample that was mistakenly
   predicted as RA) to demonstrate the rules that were applied to classify
   new observations in the test set of the established model. The first
   case was confidently predicted to be OA (probability of 0.900) with the
   following rules regarding gene expression levels: 8.44 < IL32 ≤ 8.74,
   12.20 < MXRA7 ≤ 12.40, 7.08 < SGCA ≤ 7.64, 13.00 < CLU, GZMB ≤ 6.44,
   and 7.82 < PTPN3 ≤ 7.92 ([127]Figure 4a). The second case, on the other
   hand, was predicted with certainty to be RA (probability of 0.998) with
   the following rules: 9.34 < IL32, SGCA ≤ 6.76, MXRA7 ≤ 12.00, 12.10 <
   CLU ≤ 12.50, 6.77 < GZMB ≤ 7.58, POP7 ≤ 8.62 ([128]Figure 4b). From the
   two typical cases, we show that the higher expression of IL32 and GZMB
   as well as the lower expression of SGCA, MXRA, and CLU were preferred
   for the classification of RA, and vice versa. In addition, the low
   expression of POP7 was a minor supportive factor for the indication of
   RA. However, the contributions of minor supportive or contradictory
   factors were negligible since the roles of top four major factors were
   prominent in all cases. It is important to mention again that IL32 and
   GZMB expression levels were generally higher while MXRA, SGCA, and CLU
   expression levels were generally lower in RA than in OA ([129]Figure
   2b). The third example had high expression of IL32 (9.14) and low
   expression of MXRA7 (11.60). The misclassification of this case from OA
   to RA likely occurred because the expression of IL32 of the sample was
   higher than the upper limit of IL32 in OA (8.74) and the expression of
   MXRA7 was lower than the lower limit of MXRA7 in OA (12.2). This
   violation was too severe because IL32 and MXRA7 were two of the top
   three most important features in the prediction model. Hence, the high
   expression level of SGCA and CLU as well as the low expression level of
   RAB22A, which supported the indication of OA, were not enough for the
   classifier to correctly differentiate the sample ([130]Figure 4c). The
   classification of other samples in the test set could be explained
   using a similar approach as described.

Figure 4.

   [131]Figure 4
   [132]Open in a new tab

   The use of LIME to explain the model′s predictions. (a) Corrected
   classification of an OA sample. (b) Corrected classification of an RA
   sample. (c) Noncorrected classification of a sample from OA to RA. RA:
   Rheumatoid arthritis; OA: osteoarthritis.

   A genetic signature may be more clinically relevant if it can be
   applied by different machine learning classification methods.
   Therefore, we introduced our 16-gene signature to four other commonly
   used classifiers, kNN, naïve Bayes, C5.0, and SVM, to evaluate the
   robustness and flexibility of the signature. Every model was trained
   using 10-fold cross-validation with 5-time repeats and AUCROC was used
   to determine the optimal model. In the kNN model, the data were treated
   by centering and scaling. The training model with a k value of nine was
   detected to be optimal (AUCROC = 0.96). The performance in the test set
   was also excellent (accuracy = 0.96, sensitivity = 0.92, specificity =
   1.00). The naïve Bayes model without kernel achieved the best
   performance (AUCROC = 0.98) which was similar to that of the kNN model
   in the test set (accuracy = 0.96, sensitivity = 0.92, specificity =
   1.00). The training process of the SVM with polynomial kernel produced
   the optimal model (AUCROC = 0.98) with the following tuned parameters:
   degree = 3,
   [MATH: <mrow><mi mathvariant="sans-serif">γ</mi></mrow> :MATH]
   = 0.77,
   [MATH: <mrow><mi mathvariant="normal">C</mi></mrow> :MATH]
   = 0.16. Likewise, this model exhibited excellent performance in the
   test set (accuracy = 0.96, sensitivity = 1.00, specificity = 0.90).
   Finally, the C5.0 model was optimized and validated. In the training
   set, the model with the following parameters was optimal (AUCROC =
   0.93): model = tree, winnow = false, trials = 13. In the test set, it
   had the worst performance among the tested classifiers although the
   accuracy was higher than 0.90 (accuracy = 0.91, sensitivity = 1.00,
   specificity = 0.80). Collectively, we achieved similarly outstanding
   performance regarding the differentiating of RA from OA to the
   rule-based method of Woetzel [[133]9]. Confusion matrices of the
   abovementioned classifiers on the test sets are shown in [134]Figure
   5a.

Figure 5.

   [135]Figure 5
   [136]Open in a new tab

   Summarization of the classification analyses. (a) Confusion matrices of
   the five classifiers on the test set of RA versus OA. (b) Prediction
   performances on the test sets of RA versus normalcy and OA versus
   normalcy. RA: Rheumatoid arthritis; OA: osteoarthritis.

   For a quick investigation to evaluate the possibilities of
   differentiating RA from normal synovial tissues and OA from normal
   synovial tissues, we applied the class assignment analysis module using
   ArrayMining [[137]31]. The training and test sets were defined using a
   60:40 ratios. We achieved encouraging accuracies, sensitivities, and
   specificities on the test sets, as shown in [138]Figure 5b. However,
   the sample sizes of both experiments were small, and the results should
   be validated by future studies.

3.3. Meta-Analysis of Gene Expression and Functional Analysis

   To seek the differences in biological processes of RA, OA, and
   normalcy, we further conducted meta-analysis of all available synovial
   tissue gene expression data followed by pathway enrichment analysis.
   Data from synovial tissues of RA and OA, RA and normalcy, and OA and
   normalcy were introduced to the gene expression meta-analysis to find
   the differentially expressed (DE) genes with high confidence. Cochran’s
   Q test suggested the statistical heterogeneity among the input data
   sets; thus, a random effects model was finally selected for the
   analysis.

   In RA versus OA, 1026 genes were significantly upregulated, and 1421 DE
   genes were significantly downregulated in RA compared to OA. Among
   them, 489 DE genes showed a combined effects size of at least 1.5 or
   higher (236 genes with a combined effect size (cES) of 1.5 or higher in
   the upregulated group and 253 genes with a cES of −1.5 or lower in the
   downregulated group). The three most significantly upregulated genes
   (CD3D, AIM2, and IL2RG) in RA are related to the immune processes. For
   instance, CD3D is one of the genes involved in T cell activation and
   signaling. Other associated genes, such as CD3G, CD8B, and LCK, were
   also found to be highly upregulated [[139]37]. Pathway enrichment
   analysis was conducted separately using the DE genes that had a cES of
   1.5 or higher for upregulated genes and of −1.5 of lower for
   downregulated genes in RA compared to OA. [140]Figure S2 shows the two
   protein–protein interaction networks of RA-OA for upregulated genes and
   downregulated genes, respectively. Particularly, the major enriched
   Gene Ontology (GO) biological processes in upregulated genes belonged
   to various immune responses, including the activation of NF-kB.
   Similarly, the Kyoto Encyclopedia of Genes and Genomes (KEGG) enriched
   pathways include, but not limited to, natural killer cell mediated
   cytotoxicity, cytokine-cytokine receptor interaction, chemokine
   signaling pathway, and T cell receptor signaling pathway, among others.
   On the other hand, the enriched GO terms of the downregulated genes
   reflect the downregulation of cellular development and differentiation.
   However, there were only two enriched KEGG pathways: glycerolipid
   metabolism and transcriptional misregulation in cancer. Significantly
   enriched pathways from the GO and KEGG pathway enrichment analysis are
   summarized in [141]Table 2 and [142]Table S2.

Table 2.

   Representative Gene Ontology (GO) biological processes and Kyoto
   Encyclopedia of Genes and Genomes (KEGG) enriched pathways of
   differentially expressed (DE) genes between RA and OA.
   ID Annotation False Discovery Rate RA versus OA
   GO.0006955 Immune response 8.86 × 10^−50 Upregulation
   GO.0050776 Regulation of immune response 1.44 × 10^−41 Upregulation
   GO.0002376 Immune system process 2.64 × 10^−41 Upregulation
   GO.0006952 Defense response 6.94 × 10^−39 Upregulation
   GO.0002684 Positive regulation of immune system process 4.52 × 10^−36
   Upregulation
   GO.0048731 System development 8.90 × 10^−12 Downregulation
   GO.0007275 Multicellular organismal development 3.68 × 10^−11
   Downregulation
   GO.0044767 Single-organism developmental process 2.74 × 10^−9
   Downregulation
   GO.0048856 Anatomical structure development 1.29 × 10^−8 Downregulation
   GO.0051239 Regulation of multicellular organismal process 5.14 × 10^−8
   Downregulation
   KEGG.4650 Natural killer cell mediated cytotoxicity 1.27 × 10^−16
   Upregulation
   KEGG.5340 Primary immunodeficiency 1.27 × 10^−16 Upregulation
   KEGG.4060 Cytokine-cytokine receptor interaction 3.14 × 10^−16
   Upregulation
   KEGG.4064 NF-kappa B signaling pathway 1.26 × 10^−14 Upregulation
   KEGG.4062 Chemokine signaling pathway 1.33 × 10^−11 Upregulation
   KEGG.561 Glycerolipid metabolism 0.0142 Downregulation
   KEGG.5202 Transcriptional misregulation in cancer 0.0276 Downregulation
   [143]Open in a new tab

   In RA versus normal tissues, 973 genes were significantly upregulated,
   and 690 DE genes were significantly downregulated in RA compared to
   normal synovial tissues. Among them, 539 genes showed a combined
   effects size of at least 1.5 or higher (383 genes with a cES of at
   least 1.5 in the upregulated group and 156 genes with a cES of −1.5 or
   lower in the downregulated group). The top three enriched GO terms
   among the upregulated genes were immune response (GO.0006955), immune
   system process (GO.0002376), and defense response (GO.0006952), while
   the top three enriched GO terms among the downregulated genes were
   regulation of protein metabolic process (GO.0051246), cellular response
   to organic substance (GO.0071310), and positive regulation of
   multicellular organismal process (GO.0051240). Similarly, 879 DE genes
   were upregulated, and 524 DE genes were downregulated in OA compared to
   normal synovial tissues. Among them, 412 DE genes showed a combined
   effects size of at least 1.5 or higher (274 genes with a cES of at
   least 1.5 in the upregulated group and 138 genes with a cES of −1.5 or
   lower in the downregulated group). The top three enriched GO terms
   among the upregulated genes were response to stress (GO.0006950), death
   (GO.0016265), and regulation of biological quality (GO.0065008), while
   the top three enriched GO terms among the downregulated genes were
   positive regulation of metabolic process (GO0009893), positive
   regulation of macromolecule metabolic process (GO.0010604), and
   positive regulation of cellular metabolic process (GO.0031325). The
   protein–protein interaction networks of RA-Normalcy and OA-Normalcy
   were simpler than that of RA-OA. The enriched pathways from GO and KEGG
   analysis of these two networks can be found in [144]Table S3.

4. Discussion

   There are standard approaches for the differential diagnosis of RA and
   OA. Nevertheless, their performance may not be satisfactory in some
   particular cases. For instance, anti-citrullinated protein antibodies
   (ACPAs) have been applied for early diagnosis of RA. Nevertheless, the
   test using ACPAs is only positive in approximately 50% of patients
   [[145]38]. Other biomarkers, therefore, should be developed to assist
   the diagnosis in ACPA-negative patients. Recently, serum connective
   tissue growth factor was reported as a potential diagnostic biomarker
   for RA [[146]39]. In addition, 99mTc-3PRGD2 scintigraphy has been
   recently suggested for the early detection of RA in rats and humans
   [[147]40]. It is also of importance to note that although blood-based
   tests are prominently used, the diagnostic value of synovial biopsy has
   been acknowledged [[148]41,[149]42]. Thus, it could help assist the
   diagnosis RA, OA, as well as other arthritis conditions. However, the
   potential of the transcriptomic signature of synovial tissue in
   differentiating RA and OA remains to be explored. High-throughput gene
   expression has been recognized as an important feature to facilitate
   personalized medicine [[150]43]. High-throughput gene expression has
   been commonly used to gain deeper insights into the cellular processes
   of a specific biological system [[151]44]. Indeed, gene expression
   profiling has been contributing to the advancement of many fields,
   including the management of cancers and other immunological disorders
   [[152]45]. Specifically, the genome-wide alterations in gene expression
   in RA and OA can be detected in advance [[153]3,[154]46]. Recently, a
   method using a transcriptome-based rule set was developed to
   differentiate RA from OA and normal controls; the overall assessment
   parameters of classification were approximately 90% or higher [[155]9].
   Similarly, a 12-gene signature derived from CD4^+ T cells of early
   inflammatory arthritis patients has been shown to accurately predict
   subsequent development of RA from non-RA among undifferentiated
   arthritis with a sensitivity of 68% and specificity of 70% [[156]47].
   These results are encouraging and worth validating on a larger scale to
   assess their reproducibility. Moreover, a wide range of supervised
   machine learning techniques can be employed toward the final goal,
   which is to classify a particular sample. However, small sample sizes
   and noisy data of the microarray-based high-throughput gene expression
   have become limitations since the generalization of the results is not
   guaranteed and the cross-study validation often failed to replicate the
   results [[157]48]. Potential solutions, such as cross-platform
   normalization or cross-study normalization for multi-data integration,
   have been introduced and have achieved considerable success
   [[158]49,[159]50]. In addition, conducting feature selection prior to
   classification can help reduce highly noisy attributes while
   simultaneously improving the prediction accuracy for the class
   assignment analysis [[160]51,[161]52]. Finally, but importantly, the
   current limitations in the progress of diagnosing and determining the
   prognosis of RA may come from the substantial portion of
   mechanism-based elucidation. An unbiased, data-driven approach
   emphasizing the roles of a small subset of genes that are distinctly
   different between comparative conditions may additionally provide
   potential solutions for early diagnosis. In our study, 16 genes were
   selected as the biomarker signature for differentiating between RA and
   OA using an unbiased, data-driven method. The validity of the proposed
   signature was then challenged using independent data sets with various
   machine learning classification algorithms. The classification
   performances derived from five commonly used classifiers in the
   external test set suggested that RA and OA could be accurately
   differentiated with an accuracy higher than 90% using our 16-gene panel
   and machine learning approaches, even by ‘weak’ classifiers such as
   kNN. Accordingly, these genes are worth considering as novel biomarkers
   for the early differential diagnosis of RA and OA. Among others—CD8A,
   GZMB, and KLRD1, for instance—are strongly associated with one another
   in the STRING protein–protein interaction network and are associated
   with lymphocyte-mediated immunity.

   Although the 16-gene panel was derived using a data-driven approach,
   several individual biomarkers showed a clear biological connection with
   the arthritis conditions ([162]Table S4). However, the role of TMOD1,
   SGCA, RAB22A, ANK3, PTPN3, FBXL7, CLU, POP7, and MXRA7 in inflammatory
   diseases, especially in arthritis, has not been studied thoroughly.
   Further studies are needed to clarify their biological function.
   Moreover, the recurrent inflammation of the synovial tissues suggests
   the complexity of the immune responses in the pathobiology of RA, as
   also being mentioned in a recent meta-analysis [[163]53]. Many other
   biological pathways are involved in various heterogeneous processes at
   different stages of the disease [[164]54]. Unlike other forms of
   arthritis, destruction is the nature of RA. Hence, early diagnosis and
   management are especially important. Pain, structural damage,
   functional loss, comorbidity, and other complications of RA lead to the
   permanent disability of patients as well as to increased socioeconomic
   cost [[165]55]. On the other hand, the main pathological condition of
   OA is the degradation of cartilage, which eventually results in joint
   dysfunction [[166]56]. Our functional analysis showed general
   agreement; thus, confirmed the previously reported differences in the
   mechanisms of the two diseases and suggested novel pathways that are
   prominently enriched in RA compared to OA [[167]57,[168]58,[169]59].
   From this data, further mechanistic studies are needed to obtain better
   insights into the pathophysiology of RA as well as other forms of
   arthritis in order to improve patient management.

   Our work has some limitations that should be explicitly stated. All
   analyses were conducted using tissues that might not represent the best
   approach for RA and OA differential diagnosis. However, synovial
   tissues are usually informatively rich with respect to the underlying
   pathological processes of the diseases; therefore, our findings may
   give additional information for the improvement of the patient
   management and mechanistic studies. Second, only transcriptome
   information of patients was utilized for the differential analysis so
   the effects of confounding factors remain unknown. Additionally, the
   lack of assessment using synovial fluids or serology prevented us from
   conducting deeper data mining. However, we demonstrated that
   sophisticated algorithms are powerful in introducing and validating
   potential biomarker candidates. We also illustrated an application of a
   human-friendly rule set to interpret the ‘black-box’ RF model. Further
   investigations that take into account the patients’ bio-parameters,
   omics data, and statistical learning methods will greatly improve our
   understanding and clinical practice to provide better care for the
   patients.

5. Conclusions

   The differential diagnosis of RA from other forms of arthritis,
   especially OA, using synovial biopsy is a novel approach. Our study
   introduced and successfully validated a 16-gene signature derived from
   gene expression profiling data of synovial tissues that could be
   employed for the classification of RA and OA. The signature may also
   help more specifically treat patients with RA condition. The novelty of
   this investigation is based on the utility of powerful statistical
   learning methods for variable selection and modeling. Moreover, we
   applied an explainer algorithm to interpret the rationale of the
   black-box predicting algorithms. Further clinical studies are warranted
   to validate this biomarker for the diagnosis of rheumatoid arthritis
   patients.

Acknowledgments