Abstract Myopathy refers to a large group of heterogeneous, rare muscle diseases. Bulk RNA-sequencing has been utilized for the diagnosis and research of these diseases for many years. However, the existing valuable sequencing data often lack integration and clinical interpretation. In this study, we integrated bulk RNA-sequencing data from 1221 human skeletal muscles (292 with myopathies, 929 controls) from both databases and our local samples. By applying a method similar to single-cell analysis, we revealed a general spectrum of muscle diseases, ranging from healthy to mild disease, moderate muscle wasting, and severe muscle disease. This spectrum was further partly validated in three specific myopathies (97 muscles) through clinical features including trinucleotide repeat expansion, magnetic resonance imaging fat fraction, pathology, and clinical severity scores. This spectrum helped us identify 234 genuinely healthy muscles as unprecedented controls, providing a new perspective for deciphering the hallmark genes and pathways among different myopathies. The newly identified featured genes of general myopathy, inclusion body myositis, and titinopathy were highly expressed in our local muscles, as validated by quantitative polymerase chain reaction. Subject terms: Neuromuscular disease, Molecular medicine __________________________________________________________________ This study analysed bulk RNA-sequencing data from 1221 human skeletal muscles and validated the progressive spectrum of myopathy diseases. Introduction Myopathy is a general term that refers to a large group of diseases primarily affecting the skeletal muscles. These conditions can be categorized into inherited or acquired forms, according to their aetiology. Myopathies exhibit heterogeneous phenotypes, including weakness, abnormal gait, muscle pain, difficulty swallowing, contractures, and systemic impairments, among others^[40]1–[41]3. One common feature of myopathies is a large spectrum of the degree of severity, similar to other progressive diseases, which includes asymptomatic, mild-to-moderate, and severe stages^[42]4–[43]7. However, this spectrum is usually more based on clinical observation than on well-established objective findings. RNA-sequencing for bulk skeletal muscles has been utilized in the diagnosis and research of muscle diseases (e.g., for ectopic splicing and molecular mechanism investigation)^[44]8–[45]10. It offers a sensitive perspective to understand the ongoing molecular activities in the muscle. Numerous studies have deposited their transcriptional data in various online databases, and this data is of significant value for integration, especially considering the rarity of most myopathies. However, these isolated datasets are somewhat prone to various biases from small sample size, selection of control materials, sequencing methods, different read length, etc^[46]11. Furthermore, with the emergence of more advanced and sophisticated technologies developed for myopathies in the past decade, deep phenotyping (including quantitative MRI, muscle biopsy evaluation, CTG expansion size in myotonic dystrophy) provides a multi-dimensional evaluation to depict the muscle deterioration process. Correlating these muscle-specific features with their genetic data can assist in characterizing and deciphering myopathies. In this study, we integrated transcriptional data from 1221 human skeletal muscles obtained from both online databases and our local patients, ultimately identifying a general spectrum separating normal and myopathy-affected muscles. In contrast to the traditional approach of focusing primarily on diseased samples, we reversed the perspective, emphasizing the control samples to characterize this spectrum validated using clinical features from different sources. We offered a novel perspective by using genuinely healthy muscles as an unprecedented control reference, aiming to identify both common pathways and specific features of the studied myopathies. Methods Data source and participant selection This is a retrospective integrative analysis (Fig. [47]1). The data sources include 803 muscles from the GTEx Consortium (dbGaP Accession phs000424.v8.p2)^[48]12, 291 muscles from the GEO database ([49]GSE115650^[50]13, [51]GSE140261^[52]14, [53]GSE175861^[54]15, [55]GSE184951^[56]16, [57]GSE201255^[58]17, [59]GSE202745^[60]18), and 127 muscles from Helsinki (39 of which have also been reported as [61]GSE151757^[62]19). The ethics approval of using local muscles (195/13/03/00/11) was approved by HUS (Helsingin Uudenmaan Sairaanhoitopiiri) and informed consent was obtained from each subject. All ethical regulations relevant to human research participants were followed. The inclusion and exclusion criteria for participant selection were as follows: (1) only human skeletal muscle tissue was included (no cell lines or organoids); (2) bulk-RNA sequencing was performed using high-throughput techniques (no chip arrays or single-cell data); (3) datasets were preserved in raw count format (those shared in transformed count format were excluded). Fig. 1. The workflow. [63]Fig. 1 [64]Open in a new tab Human skeletal muscle bulk-RNA-seq data from three sources (GTEx database, GEO database, and Helsinki) were integrated into a combined dataset (1221 muscles × 9231 genes). A spectrum order can be observed in this integrated dataset: Healthy→Mild disease→Moderate muscle wasting →Severe muscle disease. Different clinical features were mapped to the transcriptional data to validate this spectrum order. Tissue deconvolution was performed using skeletal muscle single-cell datasets as references, allowing us to infer the cell type composition in