Abstract The grade of a cancer is a measure of the cancer's malignancy level, and the stage of a cancer refers to the size and the extent that the cancer has spread. Here we present a computational method for prediction of gene signatures and blood/urine protein markers for breast cancer grades and stages based on RNA-seq data, which are retrieved from the TCGA breast cancer dataset and cover 111 pairs of disease and matching adjacent noncancerous tissues with pathologists-assigned stages and grades. By applying a differential expression and an SVM-based classification approach, we found that 324 and 227 genes in cancer have their expression levels consistently up-regulated vs. their matching controls in a grade- and stage-dependent manner, respectively. By using these genes, we predicted a 9-gene panel as a gene signature for distinguishing poorly differentiated from moderately and well differentiated breast cancers, and a 19-gene panel as a gene signature for discriminating between the moderately and well differentiated breast cancers. Similarly, a 30-gene panel and a 21-gene panel are predicted as gene signatures for distinguishing advanced stage (stages III-IV) from early stage (stages I-II) cancer samples and for distinguishing stage II from stage I samples, respectively. We expect these gene panels can be used as gene-expression signatures for cancer grade and stage classification. In addition, of the 324 grade-dependent genes, 188 and 66 encode proteins that are predicted to be blood-secretory and urine-excretory, respectively; and of the 227 stage-dependent genes, 123 and 51 encode proteins predicted to be blood-secretory and urine-excretory, respectively. We anticipate that some combinations of these blood and urine proteins could serve as markers for monitoring breast cancer at specific grades and stages through blood and urine tests. Introduction Breast cancer is a major threat to women's health, accounting for 22.9% of cancer cases in women [[31]1]. According to the World Cancer Report [[32]1], 458,503 cases of breast cancer–associated deaths worldwide were reported in 2008, which represents 13.7% of cancer-related deaths in women. It has been generally understood that breast cancer, probably other cancer types as well, of different stages and different grades require different treatment plans. For example, breast-conserving surgery plus radiation therapy is effective for most patients with early stage breast cancers [[33]2] while systemic therapy are generally needed for advanced stage patients, such as hormone or chemo therapy, in addition to cancer-removal surgery and radiation. In addition, cancer grades are strongly associated with prognosis [[34]3]. Specifically, more differentiated cancer grades tend to have more favorable prognosis. Clearly, correct classification of the grade and stage of a cancer has significant implications in determination of the treatment plan for a patient. Cancer stages are used to reflect the size of a cancer tumor and its extent of invasion. It has been traditionally determined by cancer pathologists based on tumor size, nodal spread and metastasis [[35]4]. In the recent past, molecular level information has been incorporated into the decision process of cancer staging, using markers such as alpha-fetoprotein and lactate dehydrogenase for determination of germ cell tumors [[36]5]. A widely used system for cancer staging is that the cancer tissues are classed into four stages, namely I, II, III and IV, with a higher stage representing a more advanced cancer. Cancer grading is a measure of the malignancy and aggressiveness independent of stage. Unlike staging, cancer grading has been predominantly done through visual inspection of the cell morphology and tissue structure [[37]3], generally lacking in using molecular level information. Compared to stage determination, it is a less developed area in cancer classification. Currently there is no universal grading system for all cancer types, instead research communities of a few cancer types each have developed their own grading systems such as the one for breast cancer developed by Bloom and Richardson [[38]6], the Gleason system for prostate cancer [[39]7] and the Fuhrman method for kidney cancer [[40]8]. While there are some differences in the detailed classification criteria, these grading systems generally classify cancer tissues to four grades: well differentiated (WD), moderately differentiated (MD), poorly differentiated (PD) and undifferentiated (UD). A number of computational studies have been published on cancer staging and grading prediction based on transcriptomic data. For example, Cui et al have reported a 198-gene and a 10-gene panel for grading and staging prediction of gastric cancers, respectively [[41]9]. For breast cancer, a grade index based on the expressions of 97 genes in cancer tissues was previously developed to classify patients with grade 2 tumors into two subgroups with high versus low risks of recurrence [[42]10]. However, markers so developed have had only limited applications since tissue-based gene-expression data are generally not available for most patients [[43]11, [44]12]. Hence, it is essential to extend tissue-based gene markers to markers that can be measured using blood or urine samples of patients [[45]13, [46]14], the challenge of which is to predict reliably which of the overly expressed proteins in cancer tissues can be secreted into blood and further into urine. In this study, we conducted a computational analysis tissue-based gene-expression data to identify possible gene signatures and blood/urine proteins markers for breast cancer grading and staging prediction. The following represents the unique contributions by this study, to the best of our knowledge: (1) RNA-seq-based gene-expression signatures for breast cancer grading and staging prediction; and (2) predicted potential marker proteins for cancer staging and grading that can be measured by using blood and urine samples. Clearly, this work represents only a pilot study for prediction of blood and urine marker proteins for breast cancer grading and staging. We expect that follow-up studies will demonstrate the feasibility of the predicted signature genes and protein markers. Results A. Identification of gene signatures for breast cancer (1) Identification of gene groups whose expressions distinguish breast cancer from other cancers Gene-expression data of 111 paired of breast cancer and adjacent control tissue samples were retrieved from the TCGA database [[47]15], where each gene-expression dataset covers 20,501 human genes measured using RNA-seq. 5,562 differentially expressed genes between cancer and matching control tissues were identified using the following procedure: the expression levels of a gene in cancer show at least 2-fold change from the matching control tissues with the q-value < 0.05 to control the False Discovery Rate (FDR) (see [48]Material and Methods). Among the 5,562 genes, 2,078 were up-regulated and 853 of them were found to be up-regulated in less than three out of 12 other cancer types that were examined in our study as references, hence making them as good