Abstract Interpreting transcriptomic data presents significant challenges, particularly in non-targeted approaches. While modern functional enrichment methods are well-suited for experimental designs involving two conditions, they are less applicable to data series. In this context, we developed Cluefish, a free and open-source, semi-automated R workflow designed for untargeted, comprehensive biological interpretation of transcriptomic data series. Cluefish applies over-representation analysis on pre-clustered protein–protein interaction networks, using clusters as anchors to identify smaller, more specific biological functions. Innovative features, including cluster merging and recovery of isolated genes through shared biological contexts, enable a more complete exploration of the data. We applied Cluefish to an in-house dataset with zebrafish exposed to a dose-gradient of dibutyl phthalate and to two published toxicology datasets featuring different organisms. Combined with DRomics, a tool for dose–response analysis—Cluefish identified gene clusters deregulated at low doses and linked to biological functions overlooked by the standard approach. Notably, it revealed that retinoid signaling disruption may be the most sensitive pathway affected by dibutyl phthalate during zebrafish development, potentially leading to morphological changes. The Cluefish workflow aims to provide valuable clues for biological hypothesis generation and experimental validation. It is freely available at [34]https://github.com/ellfran-7/cluefish. Graphical Abstract Graphical Abstract. [35]Graphical Abstract [36]Open in a new tab Introduction Advances in omics approaches has made the comprehensive measurement of DNA, RNA, proteins, and metabolites in biological samples a standard practice [[37]1]. This has led to a massive increase in generated data, providing a goldmine for uncovering novel biological functions, genotype–phenotype relationships, and elucidating molecular mechanisms [[38]2]. However, the sheer volume presents significant challenges in both analysis and interpretation [[39]3, [40]4]. In the context of transcriptomics, analyses yield extensive transcript lists that require impractical manual literature reviews. To address this, functional enrichment analysis, also known as pathway enrichment analysis, has become the standard approach. This method condenses large gene lists into more manageable and interpretable sets of biological functions or pathways [[41]5]. Functional enrichment relies on predefined gene sets representing molecular functions, biological processes, or pathways, as outlined by databases like the Gene Ontology (GO) [[42]6], pathway databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [[43]7] and Wikipathways (WP) [[44]8, [45]9], or experimentally derived gene sets available in the MSigDB [[46]10, [47]11]. Functional enrichment methods have evolved over four generations. The first generation consists of over-representation analysis (ORA) methods, which test whether a predefined gene set contains a disproportionately large number of deregulated genes compared to a uniform random sample of genes among all the genes detected in the experiment. In this case, the functional gene set is considered as enriched. The second generation includes Functional Class Scoring (FCS) methods, which evaluate whether genes within a set are concentrated at the top or bottom of a ranked list of all the genes from the experiment, based on expression changes such as fold-change [[48]11]. FCS improves on ORA by considering coordinated changes in functionally related gene sets, though it still treats genes as independent [[49]12]. The third generation introduced Pathway Topology (PT)-based methods, which leverage pathway topology information from databases such as KEGG and Reactome [[50]13]. These methods utilize knowledge about gene positions, roles and interactions within pathways [[51]14, [52]15]. While PT-based methods improve on ORA and FCS by incorporating gene dependencies and interactions, they still view pathways as isolated entities and thus overlook pathway crosstalk [[53]16]. The most recent and fourth generation, Network Topology (NT)-based approaches, goes a step further by systematically considering pathway crosstalk within a network [[54]17]. NT-based methods utilize biological interaction networks, such as those based on protein–protein interactions (PPI) from databases like FunCoup and STRING [[55]18, [56]19]. These approaches examine changes in network structure and correlations to provide a more integrated understanding of pathway interactions, accounting for crosstalk and overlap between pathways—elements neglected by previous generations [[57]20]. However, despite their potential, NT-based methods are limited. This is because biological networks are often incomplete, with many missing or unreliable interactions, and these methods tend to focus solely on the network itself, which restricts the effectiveness and completeness of the analyses [[58]17]. As a result, earlier generations of functional enrichment methods still remain largely used in practice. While these earlier methods are commonly applied to transcriptomic data with two or three conditions, dealing with data involving many ordered conditions—referred to as “data series”—poses substantial complexity, even before functional enrichment. Data series capture measurements collected over successive points in time, space, or under varying conditions (e.g. different doses of a contaminant). In dose–response (DR) designs, which explore the relationship between exposure and effect, differential gene expression (DEG) analysis, although often used, is not well-suited. DEG relies on the comparison of the response at each dose to the control (or sometimes on pairwise comparisons across all conditions), multiplying the number of tests, applying each to a subset of the data, and generating multiple fold-change values for each transcript. In contrast, DR omics analysis replaces DEG by leveraging the entire DR relationship for each transcript [[59]21]. Tools, like the DRomics R package [[60]22], identify significantly deregulated transcripts and model their DR relationships. These models are used to characterize the response (whether monotonic or biphasic) and to calculate a benchmark dose (BMD) for each transcript. A BMD is a dose or concentration that produces a predetermined change in the response rate of an adverse effect, reflecting the sensitivity of a transcript to a given stressor [[61]22]. However, FCS, and PT-based enrichment, which rely on fold-change values, are not adapted for functional enrichment in a DR context [[62]12]. In contrast, ORA is currently the only suitable method available to complement DR modeling, as it exclusively uses deregulated gene lists. Alternative approaches exist for analyzing data series, such as weighted gene co-expression network analysis (WGCNA) [[63]23], which clusters genes based on expression correlation patterns across conditions. Other methods like maSigPro [[64]24], RNFuzzyApp [[65]25], and TRAP [[66]26], tailored for time-series RNA-seq data, offer comprehensive analytical frameworks that include, among other utilities, identification of genes with significant temporal expression changes and clustering of expression patterns. Nevertheless, to derive biological meaning from these identified gene sets and clusters, functional enrichment methods, predominantly ORA, remain necessary downstream steps, whether or not they are included in the workflow [[67]27–31]. However, ORA comes with several other limitations. The enrichment results, often derived from Fisher’s exact tests, tends to highlight broad biological functions. Indeed, when using the entire list of deregulated genes, the large test denominator makes it more challenging to achieve the statistical significance needed to identify smaller or lesser-known functions. This tendency is intrinsically tied to the granularity of gene sets in certain databases (e.g. GO). Gene sets may be overly specific, while others are too general, potentially compromising the relevance of the analysis [[68]32, [69]33]. Additionally, the enrichment results represent only a fragmented portion of the deregulated gene list, as the interpretation is based solely on the biological functions and the genes participating in their enrichment. Consequently, relying solely on ORA, as done in the standard approach, is not sufficient. Additional methods are needed in combination to address the shortcomings of the standard approach. Our main goal was to create a workflow that allows to include a greater part of the deregulated gene list in the interpretation, capturing a more complete picture of the transcriptomic data. We also aimed to improve functional enrichment detection sensitivity to identify more specific and smaller biological functions while minimizing redundancy. Furthermore, we wanted to ensure the workflow is applicable to a wide range of both model and non-model organisms. To meet these goals, we constructed Cluefish (Clustering, Enrichment, and Fishing), a semi-automated workflow designed for comprehensive and untargeted exploration of transcriptomic data series. We developed Cluefish using a DR transcriptomic dataset from zebrafish embryos exposed to varying doses of dibutyl phthalate (DBP), a plasticiser found in human biological samples [[70]34] and environmental surface waters [[71]35]. This main dataset allowed us to fully test and apply the workflow, conducting an in-depth biological interpretation. To demonstrate Cluefish’s versatility across different organisms with varying annotation richness in knowledgebases, we extended our validation to two secondary externally published DR transcriptomic datasets: Sprague Dawley rat (Rattus norvegicus) liver exposed to a dose-gradient of perfluorooctanoic acid (PFOA) [[72]36], and Poplar (Populus canadensis) roots exposed to a dose-gradient of phenanthrene (PHE) [[73]37]. Cluefish is open-source and available under the CeCILL license, offering a powerful tool for improving biological interpretation and aiding in the development of targeted interventions. Materials and methods Zebrafish dataset Fish Zebrafish (Danio rerio) embryos (AB strain) were obtained from the Luxembourg Centre for Systems Biomedicine (LCSB) aquatic platform (Luxembourg University, Belval, Luxembourg). They were collected within 1 h after spawn, placed in standardized fish water at 26°C [NF EN ISO 7346-2 (1998), pre-oxygenated by bubbling for 12h] and transferred to the LIEC laboratory facilities (Metz, France). DBP solutions preparation and DBP quantification Commercial DBP (purity 99%, CAS no. 84-74-2) was purchased from Sigma-Aldrich (St. Louis, MO, USA). Solvent-free DBP stock solution was made in fish water at an effective concentration of 5.4 mg/l. Stock solution and fish water effective DBP concentration quantification were performed 1 day before the beginning of exposure (three technical replicates) on an Ultimate 3000 Rs chromatography system with UV detection at 230 nm (DAD) (Dionex, CA, USA). Quantification was performed on an injected volume of 10 μl, using an Acquity UPLC HSS T3 2.5 μm, 2.1 × 100 mm column (Waters, Germany) as stationary phase, and 60% acetonitrile and 40% H[2]O as mobile phase, under a flow of 8 ml/min at 25°C. Fish water DBP quantification were below the limit of detection (LOD) of 3 μg/l. The stock solutions were stored in the dark at 26°C for the all experiment and were stable for at least the duration of exposure under these conditions ([74]Supplementary Table S1). Tests solutions of DBP were made daily in fish water by serial dilution of the DBP stock solution in oxygenated fish water during the whole experiment to reach a concentration gradient [0–5–10–50–100 μg/l] built from environmental concentrations (median to high) up to the NOEC of DBP for chronic exposure in fish (100 μg/l), according to the European CHemicals Agency (ECHA). To avoid any external contamination of test solutions by phthalates or other plasticisers, only glassware was used for fish water, stock and test DBP solutions preparation, handling, and storage. Zebrafish embryos exposure and sample collection Thirty zebrafish embryos (256 cells stage [[75]38]) were randomly distributed in glass beakers with 60 ml of test solution, with five replicate beakers of 30 embryos for each exposure condition. Embryos were kept in a climate-controlled chamber (Cooled incubator MIR-554-PE, PHC Europe B.V.) at 26 (±0.5)°C, under 14:10 light/dark photoperiod (20 μE/m²/s). Exposure was semi-dynamic, with 1/3 of exposure solution volume being replaced every day by freshly prepared test solutions. After 120h exposure to DBP, pools of 10 embryos (“protruding mouth” stage [[76]38]) were collected as one sample, flash-frozen in liquid nitrogen, and stored at −80°C before RNA extraction. Remaining embryos were fixed in 4% paraformaldehyde (PFA) in PBS at 4°C for 1 week and washed several times with PBS before morphological assessment. Transcriptome sequencing Total RNA of zebrafish embryos (pools of 10 embryos, 5 replicates per condition) was extracted using the RNeasy mini extraction kit (Qiagen). After quality and integrity assessment of extracted RNA by NanoDrop (Thermo Fisher Scientific, Waltham, MA) and by the RNA Nano 6000 Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies, CA, USA), the three higher quality RNA samples per condition (A260/A280 > 1.7, A260/A230 > 2, RIN > 8.8) were selected for high throughput mRNA sequencing. mRNA libraries were prepared after poly-A enrichment from 250 ng total RNA using NEBNext^® Poly(A) mRNA Magnetic Isolation Module (New England Biolabs) followed by NEBNext^® Ultra^™II Directional RNA library Prep kit (New England Biolabs). Library size was checked by the DNA 1000 Assay Kit of the Bioanalyzer 2100 system. Libraries were sequenced on a NextSeq500 Illumina sequencer with 1 × 75 bp single reads, providing 40 to 57 million reads for each sample. Library preparation, quality check and sequencing were performed by Helixio, France. mRNAseq data processing Raw mRNAseq data processing was performed on the Galaxy french server provided by the “Institut Français de Bioinformatique” (IFB). Raw reads were filtered according to size (>55 bp) and quality parameters (mean Phred score > 20 on 5′ and 3′ bases and on a 4 bp sliding window) using Trimmomatic [[77]39], leading to keep 87% to 92% of total reads according to the sample. Cleaned reads were mapped to the zebrafish reference transcriptome GCRz11 and quantified using Salmon quant (v. 1.5.1) [[78]40]. Each transcript was annotated with the Ensembl transcript stable identifier (ID; e.g. ENSDART00000121728.3). Embryos morphological traits Lateral images of the fixed embryos were taken using a digital microscope (VMX-7000, KEYENCE, France). Body length and left eye area were measured using the “VHX-6000_950F Measurement Data Tabulation Tool” image processing software (KEYENCE, France). External datasets To further evaluate the generalisability of Cluefish, datasets of two external transcriptomic studies were included. These studies were selected based on the inclusion of DR experimental designs and the contrasting richness of organism gene annotation available in public knowledgebases between them. From Gwinn et al. [[79]36], we selected the dataset exploring male Sprague Dawley rat (Rattus norvegicus) liver tissue subjected to PFOA, as their analysis reported a high number of deregulated transcripts and many enriched biological functions. The experiment involved exposing male rats, aged 8–10 weeks, to a concentration gradient of PFOA [0–0.156–0.3125 – 0.625 – 1.25 – 2.5 – 5 – 10 – 20 mg/kg] for up to 5 days. Each condition included four biological replicates. RNA was extracted from liver tissue and analyzed via TempO-Seq. As described in the paper [[80]36], filtered reads were aligned to the rat S1500 + probe sequences. Each transcript was labelled with a unique combination of NCBI gene symbol and probe IDs specific to TempO-Seq assays (e.g. NFIL3_9304). The count matrix was retrieved from the Gene Expression Omnibus (GEO) under the accession number [81]GSE147072. From Gréau et al. 2024 [[82]37], we selected the dataset exploring root tissue exposed to PHE as their analysis reported the highest number of deregulated transcripts. Poplar plants (Populus canadensis) were exposed for 4 weeks to a PHE concentration gradient [0–100–200–400–700–1000–1500–2000 mg kg^−1]. Each condition included eight biological replicates. RNA was extracted from root tissue and analyzed via RNAseq. As described in the paper [[83]37], filtered reads were mapped and counted on the Populus trichocarpa genome (V4, Phytozome). Each transcript was annotated with the Ensembl gene stable ID (e.g. Potri.001G000400). The count matrix was retrieved from GEO under the accession number [84]GSE263776. The three datasets, the main dataset and two external ones, will hereafter be referred to as the zebrafish, rat liver, and poplar root datasets. Dose–response analysis The DRomics R package was used to identify significantly deregulated transcripts and model DR curves. The default parameters in DRomics were applied to all three datasets [[85]41]. Transcript count datasets were imported, normalized, and transformed into a logarithmic scale using the rlog method within DRomics [[86]42]. Significantly deregulated transcripts were selected using quadratic trend tests with a false discovery rate (FDR) of 0.05 [[87]43]. DR models (linear, Hill, exponential, Gauss-probit, or log-Gauss-probit) were fitted using non-linear regression and selected based on the lowest second-order Akaike information criterion [[88]44]. The modeled DR curves were characterized by their trends (bell-shaped, decreasing, increasing, and U-shaped) and yielded a benchmark dose (BMD-1SD) [[89]22], calculated from a benchmark response (BMR-1SD) using the following equation: Inline graphic , where y0 is the theoretical level at the control given by the DR model and SD is the residual standard deviation of the modeled DR curve [[90]22, [91]45]. The 95% confidence intervals (CIs) on the BMD-1SD values were computed by non-parametric bootstrap (5000 iterations). Modeled DR curves were discarded if the point or the interval estimates of the BMD could not be successfully calculated. For the zebrafish embryo morphological traits, body length and left eye area were likewise analyzed using the DRomics workflow for continuous anchoring (individual) data. The Cluefish workflow Cluefish is composed of 11 main data analysis steps, followed by optional data visualization steps (Fig. [92]1). The workflow was applied to the in-house zebrafish dataset and the two external published datasets. Default Cluefish parameters and detailed steps for the zebrafish case study are described below, whereas details on the choices made for the two external datasets are provided in the [93]Supplementary Text S1. Each step is as follows: Figure 1. [94]Figure 1. [95]Open in a new tab Schematic of Cluefish workflow. Steps performed in R are shown in single-line frames, with dashed frames for optional steps, while the step carried out in Cytoscape is shown within a double lined frame. In Step 5, deregulated genes, which code for proteins in STRING, are represented as nodes. Their interactions are illustrated by lines. Clusters are represented by full circles, while isolated genes are categorized as lonely genes, forming the lonely cluster, represented by a dashed circle. The schematic guides the user through the various steps but does not explicitly depict the output-to-input relationships, as some steps require multiple previous outputs. The full detailed workflow is provided in the vignette ([96]https://ellfran-7.github.io/cluefish/articles/cluefish.html) *The final step can only be performed if the list of deregulated transcripts loaded in Step 2 is derived from the outputs of DRomics. Abbreviations: bg: background, drg: deregulated. Step 1: download transcription (co-)factor annotations * Transcription factors (TFs) and transcriptional co-factors (CoTFs) form multi-protein complexes that mediate cellular signaling and facilitate gene transcription [[97]46, [98]47]. This combinatorial function establishes a crucial link between signaling pathways and gene regulation [[99]48]. To incorporate this regulatory information, lists of TF and CoTF genes for the specified organism are obtained from the Animal Transcription Factor Database 4.0 (AnimalTFDB4) [[100]49]. If these files are not available for a particular organism, this step can be skipped. The zebrafish TF and CoTF gene lists were successfully downloaded. Step 2: load background and deregulated transcript lists * The background transcript list includes all transcript IDs detected in the experiment, whereas the deregulated transcript list contains only the IDs of significantly deregulated transcripts. The deregulated transcript list was derived from the DRomics pipeline in our case studies. Step 3: retrieve gene identifiers * Transcript IDs are converted to gene IDs and other pertinent annotations by linking with the Ensembl database via the biomaRt R package [[101]50, [102]51]. To ensure compatibility with databases and tools used later in Cluefish, specifically the STRING database and g:Profiler [[103]52], the results must include IDs supporting the organism in both platforms. In most cases, the Ensembl gene ID meets this requirement for well-referenced organisms, significantly reducing the need for repeated ID conversions throughout the workflow. In this step, all detected transcript IDs are mapped to their corresponding Ensembl gene IDs, forming the foundation for the deregulated gene list. In our case, the zebrafish transcript IDs were converted to their Ensembl stable gene ID and external gene name, corresponding to readable gene names. Step 4: determine deregulated gene regulatory status * The next step involves determining the regulatory status of the deregulated genes, using the TF and CoTF annotation files acquired earlier. If these files are not available for a particular organism, this step can be skipped. Since the files were successfully downloaded for zebrafish, it was possible to assess the regulatory status of the deregulated genes. Step 5: construct and retrieve the clustered protein–protein interaction (PPI) network * The preceding output is then manually imported into the Cytoscape application [[104]53]. Using the StringApp plugin [[105]54], the deregulated transcripts’ STRING-compatible IDs are entered into the Protein Query search bar. This query process is customisable, offering inclusion of specific interaction types, setting limits for interaction confidence scores, and supplementary interactions. The default value in STRING is set at 0.4 for the confidence score limit to construct the PPI network, representing the probability that the interaction really exists given the available evidence. In our zebrafish case study, a confidence score of 0.9 was utilized to only consider highly confident interactions, and we included all interaction types. The resulting PPI network was constructed from both known and predicted protein–protein associations from the STRING database. To identify clusters, the Markov Chain Clustering (MCL) algorithm is applied using the clustermaker2 extension [[106]55, [107]56]. The granularity of clusters is controlled by the inflation parameter, with lower values [[108]1, [109]2] yielding fewer, larger clusters, and higher values [[110]5, [111]6] resulting in more, smaller clusters. Very large clusters may lack coherent biological context, while very small clusters may not represent a biological complex or may only be involved in specific cases. In our zebrafish case study, this parameter was set at 4 which is the default value in clustermaker2. This setting balances the creation of few large clusters and many small clusters. The node table, containing clustered elements, is manually exported as a CSV file. Step 6: filter clusters * Returning to the R environment, clusters are filtered based on their gene set sizes using what is called the lower cluster size filter, under which clusters are removed. By default, in Cluefish, this filter is set at 4. The choice of lower cluster size limit may depend on the study, organism, and earlier decisions during PPI network creation and clustering. Higher size limits result in fewer large clusters, whereas smaller limits retain both large and small clusters, removing only very small clusters. In our zebrafish case study, the default value was used. Step 7: perform cluster-wise functional enrichment * To characterize each cluster, ORA can be conducted using the gprofiler2 R package [[112]57], which interfaces with g:Profiler. This tool was chosen firstly for its extensive support of both model and non-model organisms, totaling 984 and the ability to use custom annotation data. The main tool of interest in g:Profiler is g:GOst, which can perform ORA on multiple gene lists and search against a wide range of functional databases, including GO, KEGG, and Wikipathways (WP). Specifically, Fisher’s one-tailed tests can be applied to measure the likelihood that the overlap between a query gene set and biological function gene sets is due to chance. Another important aspect of g:Profiler is its access to previous versions of the software and data, ensuring reproducible analyses. Furthermore, it offers a filtering method for highlighting GO driver terms by considering the underlying topology of annotations, as described in Kolberg et al. 2023 [[113]52]. By default, in Cluefish, the deregulated gene list is concurrently analyzed against GO Biological Processes (GO:BP), KEGG, and WP pathways, using the background gene list comprising all annotated protein-coding genes from the experiment. The FDR multiple testing correction method is applied, with a significance limit set at 0.05. To address the inherent complexity and redundancy within GO term hierarchies, only highlighted GO:BP terms are retained. In our zebrafish case study, the default settings were chosen. In addition to these, we implemented two new filters. * The first one, called the biological function size filter, are the minimum and maximum gene set sizes of biological functions, i.e. the number of genes associated with each biological function found within the background gene list. By default, in Cluefish, lower and upper biological function size limits are set at 5 and 500 respectively. Very large gene sets, which are dependent on the organism, are often associated with broad biological processes. These can introduce noise and obscure more specific functions. Conversely, very small gene sets hold little statistical support, as minor changs in gene membership can substantially affect FDR values, leading to unstable enrichment. Additionally, very small gene sets can be overly specific and may not capture the full biological context of a pathway. These lower and upper limits can be adjusted based on the study, the organism, and the type of transcriptomic data (e.g. whole transcriptome or tissue-specific). In our zebrafish case study, the default values were used. * The second filter, called the enrichment gene count filter, is the minimum number of genes participating in the enrichment of a biological function. By default, in Cluefish, the enrichment gene count filter is set at 3 in order to avoid identifying biological functions enriched by too few genes, particularly in smaller clusters. Such clusters can lead to a higher rate of false positives, and enrichments based on very few genes may not provide substantial biological insight into the cluster itself. This limit should be proportionate to the cluster sizes, particularly the smallest clusters. In our zebrafish case study, the default value was used. Step 8: merge clusters * After characterizing the clusters, the next step is to identify and merge clusters that share the same enriched biological functions from at least one specified database (e.g. GO, KEGG etc.) in the ORA. By default, in Cluefish, the merging process is conducted separately for each chosen database, in descending order of the total number of unique genes present in both the database and the background list. This step combines clusters whose proteins were insufficiently interactive in STRING to form a single cluster but are shown through functional enrichment to participate in the same biological processes. Merged clusters thus form larger, contextually unified groups representing distinct biological functions. In our zebrafish case study, the default settings were chosen, with clusters first merged based on shared GO Biological Process (GO:BP) functions first, followed by KEGG, and finally WP. Step 9: fish lonely genes * Genes not assigned to any cluster are referred to as « lonely genes ». These genes are not typically explored because they do not contribute to functional enrichment, and thus, they do not help in characterizing the data. Genes become lonely in one of two cases: either they are initially not sufficiently interactive with another protein complex to participate in or form a cluster, or they were initially part of a cluster but failed to pass the cluster size filtering step, thus falling into the lonely cluster. To achieve a more comprehensive analysis and expand clusters, lonely genes are incorporated into existing clusters based on shared functional annotations. Specifically, if a biological function annotation of a lonely gene matches a biological function enriched in a cluster, the lonely gene is integrated into the corresponding cluster. The friendliness metric represents the number of clusters a lonely gene can be incorporated into. Step 10: perform functional enrichment of the lonely cluster genes * The lonely cluster, now consisting of the remaining lonely genes, can undergo a simple functional enrichment analysis to gain further insight into the biological context of the genes not associated with any existing clusters. The customizations used in the earlier cluster-wise enrichment analysis are fully applicable here as well. Therefore, the parameters were set to match those chosen in the previous cluster-wise functional enrichment analysis for the zebrafish dataset. Step 11: generate outputs for exploration * To fully explore the Cluefish results, DRomics results are required in order to include DR modeling metrics. While DR analysis is not required for previous steps of Cluefish, it becomes essential for generating the following outputs. A summary table, which captures key details from each step and tool used in the Cluefish workflow, can be generated and exported to CSV for manual exploration. Additionally, plots of fitted DR curves for each cluster’s gene set can be generated and exported to PDF, offering a visual exploration of cluster content. Three report templates created using Quarto (Allaire, J.J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C., 2024. Quarto. Zenodo. [114]https://doi.org/10.5281/zenodo.5960048) are incorporated into the Cluefish compendium, which may be useful to users. Examples can be found in [115]Supplementary Files 3–[116]5. These templates summarize Cluefish results interactively, integrate them with DRomics findings, and enhance data usability for comprehensive biological interpretation. Although these reports can be generated automatically, they may require adjustments to suit various types of cases. The standard workflow To illustrate Cluefish’s capacity to improve biological interpretation of DR transcriptomic data, we compared it with the standard approach. This involved applying a simple functional enrichment analysis to the entire deregulated gene list, using the same parameters as those chosen for the cluster-wise enrichment step in Cluefish. For the zebrafish and rat liver datasets, three databases were used: GO:BP, KEGG, and WP. For the poplar root dataset, only GO:BP and KEGG were used, as WP lacked sufficient coverage. Both approaches were compared in terms of number of enriched biological functions and number of deregulated genes involved in them. Implementation For gene ID retrieval and functional enrichment analyses across all three datasets, we used Ensembl genome assemblies (release 111): GRCz11 (GCA_000002035.4) for zebrafish, mRatBN7.2 (GCA_015227675.2) for rat liver, and Pop_tri_v4 (GCA_000002775.4) for poplar root. Details of the versions of tools and databases used for each dataset are provided in [117]Supplementary Table S2. Cluefish is primarily implemented in the R statistical computing language [[118]58], with a short manual processing step halfway in the Cytoscape application. Dependencies in R, including binary dependencies, are automatically bundled and installed using renv (v. 1.0.7) [[119]59]. Cluefish is structured as a research compendium, which is an organized collection of files and directories encompassing all necessary data, code, and documentation required to reproduce the research [[120]60]. Git is utilized for version control (v. 2.45.2.windows.1.). The workflow requires two main data inputs: the background and deregulated transcript lists. If these inputs come from DRomics results, the entire Cluefish workflow can be used. However, if the inputs are from a different selection method, Cluefish can be applied up to Step 10. This is because the outputs for visual exploration depend on DRomics figures (e.g. DR curves) and the associated computed metrics. Cluefish is freely available at [121]https://github.com/ellfran-7/cluefish with an associated vignette serving as a guide through the Cluefish workflow with some additional functionalities available, which has not been included in the workflow description for the sake of simplicity ([122]https://ellfran-7.github.io/cluefish/articles/cluefish.html). Results Dose–response analysis A total of 41 396, 2654, and 34 699 transcripts were detected in at least one sample across the zebrafish, rat liver, and poplar root datasets, respectively. One control sample from the zebrafish dataset was excluded from the analysis due to its divergence in the principal component analysis (PCA) plot, indicating it was an outlier ([123]Supplementary Fig. S1). Thus, the analysis proceeded with the remaining 14 samples from the original 15, resulting in 39 890 transcripts forming the background transcript list for the zebrafish dataset. The main outputs of the DR analysis with DRomics are summarized in Tables [124]1 and [125]2 for all three datasets. Across all datasets, most deregulated features exhibited monotonic trends along the dose-gradient, with 93%, 79%, and 60% showing either increasing or decreasing trends for the zebrafish, rat liver, and poplar root datasets, respectively, and a minority of bell or U-shaped trends. Table 1. Number of transcripts across the three datasets (zebrafish, rat liver, and poplar root) showing the total experimental background, deregulated transcripts identified by DRomics, and deregulated transcripts with computed BMD and CIs Zebrafish Rat liver Poplar root Total background transcripts 39 890 2654 34 699 Deregulated transcripts 2449 873 6463 With BMD and CI 2433 843 6352 [126]Open in a new tab Table 2. Distribution of transcript trend types across the three datasets (zebrafish, rat liver, and poplar root). The table presents the number and percentage of transcripts exhibiting monotonic (increasing or decreasing) and biphasic (bell-shaped or U-shaped) trends among deregulated transcripts with computed BMD and CIs Zebrafish Rat liver Poplar root Increasing (inc) 1132 (46%) 314 (37%) 2156 (34%) Decreasing (dec) 1145 (47%) 353 (42%) 1683 (26%) Bell-shaped (bell) 64 (3%) 71 (8%) 1375 (22%) U-shaped (U) 92 (4%) 105 (13%) 1138 (18%) [127]Open in a new tab Comparison of both approaches Before performing functional enrichment analysis, transcript IDs were converted to Ensembl gene IDs for the zebrafish and rat liver datasets, as described in Step 3. The poplar root dataset was already annotated with Ensembl gene IDs. In the zebrafish dataset, the background and deregulated transcript lists (with computed BMD and CI) were converted to 27729 and 2365 unique Ensembl stable gene IDs, respectively. For the rat liver dataset, these lists were mapped to 2367 and 756 unique Ensembl stable gene IDs, respectively. Simple functional enrichment analysis revealed that the deregulated gene lists significantly enriched multiple biological functions across the GO:BP and KEGG databases in all datasets, and in the WikiPathways (WP) database for the zebrafish and rat liver datasets only. For the zebrafish dataset, the standard approach identified enrichement in 12 GO:BP, 6 KEGG and two WP biological functions. A total of 320 genes contributed to these enrichments, while 2045 genes did not ([128]Supplementary Table S3). Cluefish identified all 6 KEGG and 2 WP pathways found through the standard approach and additionally detected 45 KEGG and 8 WP pathways (Fig. [129]2A). In contrast, only 5 GO:BP terms were shared between both approaches. Cluefish identified 36 unique GO:BP terms, whereas the standard approach identified seven. These seven GO:BP functions were higher-level parent terms, representing more general and broader biological concepts. They were redundant with the lower-level child terms found by Cluefish, although they did not share the same GO term names. Cluefish led to the consideration of an additional 733 genes into enriched biological functions, accounting for 30% of the deregulated gene list (Fig. [130]2B). Of the 320 genes found by the standard approach, 293 were also recovered by Cluefish, leaving 27 unique to the standard approach. Figure 2. [131]Figure 2. [132]Open in a new tab Significantly enriched biological functions (left panel) and involved deregulated genes (right panel) for zebrafish (A and B), rat liver (C and D), and poplar root (E and F) datasets. Left panels (A, C, andE) illustrate the overlap of enriched functions by database and approach. Right panels (B, D, andF) display the overlap of deregulated genes involved in these enriched functions, representing the set of genes considered for biological interpretation for both approaches. Rounded squares indicate the total number of deregulated genes with computed BMD and CIs. For the rat liver dataset, the standard approach identified enrichment in one GO:BP, four KEGG, and one WP biological function, involving 69 genes, whereas 687 genes did not participate in any of these enrichments ([133]Supplementary Table S4). Cluefish recovered all functions detected through the standard approach, while also adding 16 GO:BP, 64 KEGG, and 7 WP functions (Fig. [134]2C). Cluefish incorporated an additional 282 genes into enriched biological functions, accounting for 37% of the deregulated gene list (Fig. [135]2D). Of the 69 genes participating in enrichment found by the standard approach, 64 were also recovered by Cluefish, leaving 5 genes unique to the standard approach. Full details on outputs and details of key steps throughout the Cluefish application of the rat liver dataset can be found in [136]Supplementary Tables S5–[137]S10. For the poplar root dataset, the standard approach identified enrichment in 13 GO:BP and 1 KEGG pathway, involving 466 genes, while 5886 were not associated with any enriched function ([138]Supplementary Table S11). Cluefish recovered the same KEGG pathways detected by the standard approach and identified 9 additional ones (Fig. [139]2E). Only 4 GO:BP terms were shared between both approaches. Cluefish identified 13 unique GO:BP terms, whereas the standard approach identified 9 unique GO:BP terms. Of the 9 GO:BP terms uniquely identified by the standard approach, some were redundant with terms found by Cluefish, represented under different names and at different levels of the GO hierarchy. For example, while both approaches identified the “translation” (GO:0006412) term, the standard approach additionally identified the “Ribosome” (KEGG:03010) and “ribosome assembly” (GO:0042255) term. These biological terms substantially overlapped with “translation.” Specifically, 93 out of 101 deregulated genes (92%) annotated for “Ribosome” and 26 out of 31 deregulated genes (84%) annotated for “ribosome assembly” were also among the 198 genes annotated for “translation” ([140]Supplementary Table S12). Cluefish led to the consideration of an additional 1778 genes into enriched functions, accounting for 28% of the deregulated gene list (Fig. [141]2F). Of the 466 genes considered by the standard approach, 403 were also considered by Cluefish, leaving 63 genes uniquely considered by the standard approach. Full details on outputs and aspects of key steps throughout the Cluefish application of the poplar root dataset can be found in [142]Supplementary Tables S5, [143]S6, and [144]S13–[145]S16. Exploration of the Cluefish application to the zebrafish dataset Overview of the results From the AnimalTFDB4 database, 3328 Ensembl genes corresponding to known zebrafish TFs and CoTFs were retrieved. Among the 2365 deregulated genes, 286 were identified as coding for these transcriptional regulators. After constructing the PPI network and performing MCL clustering, 204 clusters were formed ([146]Supplementary Table S5). Of these, only 53 clusters passed the size filter, encompassing 399 genes. Cluster-wise ORA identified 41 driver GO:BP, 51 KEGG, and 10 WP-enriched functions. See the [147]Supplementary Table S6 for the cluster-wise ORA results both before and after filter, and the [148]Supplementary Table S17 for the initial ORA results prior to filtering. Several clusters were merged based on shared biological function, reducing the total number of clusters to 44. Following lonely fishing, the number of genes in the newly expanded clusters increased to 904 (37% of the deregulated gene list). The majority of fished lonely genes had a friendliness value of one, two, or three, with the highest value reaching eight. The final number of clusters was 45 (44 expanded clusters + Lonely cluster) ([149]Supplementary Table S18). The size of the lonely cluster was greatly reduced, containing 1461 genes. The functional enrichment of the lonely cluster identified six driver GO:BP, four KEGG, and two WP biological functions ([150]Supplementary Table S19). These functions did not overlap with any from the cluster-wise enrichment. With this added information, over 1026 deregulated genes enriched at-least one biological function, corresponding to 43% of the deregulated gene list. The sensitivity of the 44 expanded clusters can be illustrated using summarized BMD-1SD values for the transcripts within each cluster (Fig. [151]3A). The first quartile represents the value below which 25% of the BMD of the transcripts fall below. According to this summary, the most sensitive clusters #43, #37, #10, and #15 with BMD-1SD first quantile values of ∼1.95, 2.19, 2.23, and 2.62 μg/l, respectively. In contrast, clusters #51, #27, #35, and #38 were the least sensitive, with first quartile values of 33.6, 31.7, 31.2, and 29.8 μg/l, respectively ([152]Supplementary Table S20). Figure 3. [153]Figure 3. [154]Open in a new tab Results of the Cluefish workflow applied to the zebrafish dataset. (A) Summarized BMD 25th quantile values for each cluster, with points ordered from most to least sensitive and sized according to the number of transcripts. (B–E) DR fitted curves and computed BMD points for transcripts in (B) cluster 43, (C) cluster 37, (D) cluster 10, and (E) cluster 15, with colors indicating the trend for each curve shape: dec (decreasing), inc (increasing), and U (U-shaped). Grey dotted vertical lines represent the experimental doses. In-depth exploration of findings In ecotoxicology, the primary focus is on examining the impact of low doses of contaminants in the environment. Prioritizing the exploration of the most sensitive clusters is a reasonable approach in this context. Since the cluster #37 was not characterized, it was excluded from subsequent analysis (Fig. [155]3C). We chose to focus on clusters #43, #10, and #15, as they exhibited the lowest 25th BMD quantile values. Cluster #43 comprised 9 distinct transcripts and genes, with some showing increasing and some showing decreasing trends (Fig. [156]3B). Most of the transcripts BMDs ranged between 1.23 and 7.27 μg/l. This cluster solely enriched the “Retinol metabolism” (KEGG:00830) KEGG pathway, with no GO:BP or WP function enrichments. Cluster #10 was composed of 16 distinct transcripts, with the majority showing decreasing trends, also with similar shaped DR curves (Fig. [157]3D). Ten BMD values ranged between 0.84 and 5.29 μg/l, while the remaining six ranged between 23.1 and 49.4 μg/l. This cluster enriched sphingolipid metabolism functions in GO:BP and KEGG, specifically “sphingolipid metabolic process” (GO:0006665) and “Sphingolipid metabolism” (KEGG:00600), respectively. Cluster #15 comprised 31 distinct transcripts, mostly showing decreasing trends, also with similar DR curve shapes (Fig. [158]3E). In this cluster, 18 BMD values ranged between 0.49 and 7.37 μg/l, while the remaining 13 BMD values ranged between 19.0 and 57.1 μg/l. This cluster enriched one GO:BP (lipid biosynthetic process; GO:0008610), one KEGG (steroid biosynthesis, KEGG:00100), and one WP (cholesterol biosynthesis, WP:WP1387) biological function. None of the three clusters contained known transcriptional regulators. Investigations towards biological interpretation Discovering the retinol metabolism pathway as not only deregulated but also highly sensitive prompted us to investigate the involvement of Hox genes, which are among the primary canonical targets of retinoic acids (RAs) [[159]61–63]. All the Hox genes were located within the lonely cluster, where we identified 13 Hox gene transcripts, all showing increasing trends (Fig. [160]4A). Eleven of these transcripts had BMD values between 2.51 and 6.80 μg/l and very similar DR curves, while the remaining two had values of 28.9 and 40.6 μg/l ([161]Supplementary Table S21). All of these genes act as transcriptional regulators. Figure 4. [162]Figure 4. [163]Open in a new tab DR curves and computed BMD points for deregulated zebrafish transcripts and individual endpoints. (A) Hox gene transcripts. (B) Transcripts associated with eye development and function, with transcripts’ genes from the crystallin family highlighted in pink and other eye-related genes in light grey. (C) Body length measurements. (D) Eye surface area measurements. Grey dotted vertical lines represent the experimental doses. Further exploration of the lonely cluster revealed that the biological functions enriched by the highest number of deregulated genes were “lens development in camera-type eye” (GO:0002088) and “visual perception” (GO:0007601), each enriched by 26 genes. In total, 58 deregulated genes were related to eye development or function (Fig. [164]4B). Among these, 27 crystallin gene transcripts were found, all following decreasing trends, with BMD values ranging from 18.6 to 52.6 μg/l ([165]Supplementary Table S22). Our transcriptomic findings led us to investigate individual endpoints. At the individual level, zebrafish embryos exhibited a dose-dependent reduction in body length and eye surface area, with BMD values of 11.5 and 55.1 μg/l, respectively (Fig. [166]4C and D). See [167]Supplementary Table S23 for the raw left eye area and body length measurements. Discussion Building the road towards the biological interpretation of transcriptomic data series is a challenging task. To address this, we developed Cluefish, a free and open-source, semi-automated R workflow that offers a comprehensive and untargeted exploration of such data. When used alongside DRomics, Cluefish offers a more complete analysis of DR transcriptomic data. This workflow aims to address the limitations of the standard approach towards biological interpretation by leveraging the power of ORA on pre-clustered networks. These clusters act as anchors for ORA, improving enrichment detection sensitivity. This allows us to identify smaller, more specific biological processes while simultaneously forming exploratory gene groups. Cluefish integrates valuable information from various biological function/pathway databases (e.g. GO, KEGG, and WP), the AnimalTFDB4 database of TFs and co-factors for 183 animal genomes, and the STRING database of known and predicted PPIs for 12 535 organisms. Despite drawing from multiple complementary databases, these databases collectively provide a broad coverage of organisms, ensuring that Cluefish is applicable to a wide range of both model and non-model organisms. Innovative features such as cluster merging and the fishing of lonely genes based on shared biological context ensure a more comprehensive analysis. The standard approach identified a limited, overly broad set of enriched functions for the three studied datasets. This limitation stems from the statistical method and the coarse granularity of functional gene sets. As a result, both the qualitative and quantitative interpretation of the data suffer. Additionally, focusing solely on enriched functions is risky, as it may overlook valuable insights from deregulated genes that do not contribute to functional enrichment. Indeed, with the standard approach, a substantial proportion of deregulated genes failed to contribute to any functional enrichment: over 84% in zebrafish, 81% in rat liver, and 93% in poplar root datasets. By analyzing biologically homogeneous gene subsets, Cluefish minimizes the influence of unrelated genes and ensures a more focused analysis. Moreover, by reintegrating lonely genes back into these clusters through shared biological context, Cluefish enhances the biological coherence of each cluster and increases the number of genes available for biological interpretation. Some genes, however, remain lonely and persist in the lonely cluster. These genes should not be dismissed as noise, as their exclusion from PPI-based clusters likely reflects either [[168]1] limited interactions or annotation data for the organism in public databases, or [[169]2] their involvement in non-physical associations, such as transcriptional regulation or functional similarity not captured through physical interaction networks. By explicitly characterizing the biological functions represented in the lonely cluster, Cluefish ensures that no potentially valuable data is overlooked. The effectiveness of the Cluefish workflow is illustrated in our zebrafish case study by the identification of specific clusters, such as clusters #43, #10, and #15, which contain transcripts with high sensitivity values. Importantly, the biological functions tied to these clusters, including retinol and lipid metabolism, were missed by the standard approach, as they were not identified as enriched. Cluster #43 includes genes involved in retinol (vitamin A) metabolism. Our findings suggest a mechanistic hypothesis centred on the disruption of RA signaling. Disrupted retinol metabolism may lead to altered levels of biologically active retinoids, such as 9-cis RA, all-trans RA (ATRA), and 11-cis Retinal, causing significant changes in retinoid signaling in zebrafish embryos. This hypothesis is supported by the deregulation of several ATRA target genes abundance. Hox genes, particularly those located in the anterior part of the embryo, are canonical targets of RA signaling [[170]64, [171]65]. Among the 48 Hox genes in the zebrafish genome [[172]66], we detected 47 in our dataset, with 13 upregulated and homogeneous in DR curve shape, all located in the anterior positions 1–9 of the Hox clusters. Other canonical ATRA target genes are cyp26a1, cyp26b1, and cyp26c1. Cyp26 enzymes are key regulators of ATRA levels control in cells, by transforming ATRA into inactive metabolites. Cyp26 are thus responsible for the prevention of activation of RA receptors by RA in specific cells [[173]65]. Cyp26 coding transcripts are not identified as deregulated in our dataset ([174]Supplementary Fig. S2A–C)). However, cyp26a1 was selected as significantly deregulated along the dose-gradient but no DR model was successfully fitted to the DR curve of this transcript. Visual inspection of expression data for cyp26a1 suggests however a bell-shape response of transcript abundance. Altogether, these responses in abundance of transcripts of canonical ATRA response genes suggest an increased ATRA signaling at the lowest DBP exposure concentrations. RA plays a central role in early patterning and definition of the spinal cord during zebrafish development [[175]67]. Disruption of this pathway has been linked to teratogenic effects, including spinal deformities such as bent spine and malformed tail, highlighting its crucial role in embryogenesis [[176]68, [177]69]. The dose-dependent reduction in body length observed in zebrafish embryos further strengthens this connection. In addition, the crucial role of RA in ocular development [[178]70] can be linked to the identification of 59 deregulated genes related to eye development. The downregulation of crystallin genes—responsible for maintaining lens transparency and refractive power [[179]71, [180]72]—points to potential defects in lens function. The dose-dependent reduction in eye surface area aligns with this, suggesting compromised eye development linked to disrupted retinoid signaling. The sensitive downregulation of genes involved in lipid metabolism and transport (clusters #10 and #15) could also be attributed to the deregulation of retinol metabolism under DBP exposure. 9-cis RA serves as a ligand for the retinoid X receptors (RXRs) that acts as co-receptors for several nuclear receptors involved in lipid metabolism genes regulation [[181]73]. This may otherwise be due to the sensitive downregulation of one of the genes coding for RXRs, rxraa ([182]Supplementary Fig. S3). While it is currently unclear whether rxr genes expression is under direct regulation of retinoids, both this rxraa downregulation at the lowest DBP concentration and retinoid abundance deregulation could directly or indirectly participate to the downregulation of lipid metabolism and transport genes. This hypothesis of DBP mode of action based on retinoid signaling disruption is consistent with findings from previous studies [[183]69, [184]74, [185]75]. This has been previously suggested through screening in mouse stem cell lines [[186]75] and observed in transcriptomic data from adult male fish testes [[187]76] as well as in rodent models [[188]77]. However, most studies on DBP focus on its interaction with the PPAR signaling pathway [[189]78–80]. By applying the Cluefish workflow, our DR transcriptomic data suggests that retinoid signaling, and function disruption may be the most sensitive pathway affected by DBP during zebrafish development, potentially leading to significant morphological changes. RA signaling, while still understudied in the context of endocrine disruption, has been increasingly recognized as vulnerable to pollutants in both toxicological and ecotoxicological settings [[190]81, [191]82]. Given the critical role of RA signaling in embryonic development, which is conserved across both vertebrates and invertebrates, the potential impact of DBP on this pathway deserves closer attention [[192]83]. These findings illustrate how Cluefish can reveal connections that would have been missed by the standard approach. By investigating clusters, we identified links between them and guided us to delve deeper into the lonely cluster, where valuable insights resided, and target individual endpoints. Cluefish enabled us to formulate hypotheses supported by multiple concordant elements that not only make biological sense but also hold a major scientific interest. These insights primarily arise from exploring three of the four most sensitive clusters. While a more comprehensive analysis could uncover additional roads for biological interpretation, a complete biological interpretation is beyond the scope of this paper. An important consideration in biological interpretation approaches for transcriptomic and other omics data is the organism under study. Gene annotation richness varies substantially across species, typically reflecting whether the organism is an established model system with extensive reference databases or a less-studied species with limited annotation. A critical aspect of validating Cluefish’s capacity to enhance biological interpretation was testing its performance across diverse organisms. To assess Cluefish’s versatility and generalizability, we applied it to two externally published DR transcriptomic datasets: rat liver exposed to PFOA and poplar roots exposed to PHE. In the rat liver dataset, cluster #3 emerged as the most sensitive to PFOA exposure. This cluster was characterized by enrichment in cellular response to lipid (GO:0071396), superoxide metabolic process (GO:0006801), positive regulation of tumor necrosis factor production (GO:0032760), and activation of innate immune response (GO:0002218). These findings suggest that even the lowest doses of PFOA may trigger inflammation, oxidative stress, and lipid metabolism disruption. While the original study did not include detailed biological interpretation, our results align with existing literature on PFOA toxicity mechanisms [[193]84–88]. For the poplar root dataset, the standard approach identified several redundant biological functions with substantial gene overlap. In contrast, Cluefish detected more specific processes within similar branches of the GO term hierarchy. In the original study, authors reported enrichment in oxidative stress response, cellular redox homeostasis, defense response, and biotic stimuli response pathways. Cluefish successfully identified these same, or similar, functions but also provided additional resolution. Notably, the authors hypothesized that PHE induces ethylene signaling pathway responses, yet this biological function was not identified as enriched in their analysis nor the standard approach in this study, illustrating the challenges of developing complex hypotheses without prior knowledge of potential mechanisms. Cluefish, however, revealed biological functions directly supporting this hypothesis, including ethylene-activated signaling pathway (GO:0006979), hormone-mediated signaling pathway (GO:0009755), and response to toxic substance (GO:0097237). This demonstrates how Cluefish can reduce the need for extensive manual mining of data functional annotation and allow for a more automatic exploration independent of researcher prior knowledge. The preliminary insights gained from these external datasets demonstrate Cluefish’s potential, though a more comprehensive analysis would be necessary to plan targeted experimental validations and fully elucidate the molecular mechanisms underlying toxicant effects. The Cluefish workflow involves several filters, and selecting appropriate values can be challenging. One key parameter is the lower cluster size filter, which excludes clusters based on gene set size. Users should evaluate the total number of clusters, the number of genes contributing to enrichment, and the number of lonely genes successfully fished. The goal is to find a balance between these metrics and identify the point at which the lonely gene fishing process begins to plateau. Other filters depend on factors such as the organism (from poorly to well-annotated genomes), the type of transcriptomic data (e.g. whole transcriptome versus tissue-specific), and choices made in previous steps. We recommend exploring a range of parameter settings to better understand their effects and to fine-tune the workflow for each specific dataset. Clustering is essential in RNA-seq data analysis, grouping genes or samples by shared traits like expression patterns or network connections [[194]89, [195]90]. A widely used approach is weighted correlation network analysis (WGCNA) [[196]91], which identifies co-regulated genes by constructing co-expression networks from their expression profiles across samples [[197]92]. The underlying hypothesis behind WGCNA is that coregulated genes are functionally related. However, due to the complexity and multi-faceted interconnectivity of organismal pathways, this can result in clusters of co-regulated but functionally unrelated genes. In our approach, DRomics first identifies a list of deregulated transcripts. Then, using STRING within Cytoscape, we group genes based on known and predicted interactions, directly forming clusters of functionally relevant genes without relying on gene expression patterns. This approach avoids clustering functionally unrelated genes and allows us to characterize the clusters and their genes using the results of DR modeling, within a non-targeted approach. PPI network construction and clustering are often applied post-enrichment to visualize relationships between gene sets, using tools like GSCluster [[198]93] and NeVOmics [[199]94]. Other tools like EnrichmentMap [[200]95] and ClueGO [[201]96], both extensions of Cytoscape, help reduce redundancy in biological functions via PPI networks by accounting for both the overlap and interactions between gene products. Recently, clustering PPI networks before enrichment has gained traction, as seen in tools like PathFindR [[202]97] and ANUBIX [[203]98]. This approach improves enrichment sensitivity, uncovering hidden biological mechanisms [[204]99]. Though both tools share the pre-clustering concept with Cluefish, their application to lesser-referenced organisms is limited. ANUBIX uses constrained randomization to evaluate subnetwork relevance, making it optimal for highly comprehensive PPI networks. PathFindR, primarily designed for human datasets, is applicable to non-human organisms but enrichment is limited to KEGG, Reactome [[205]100], and MSigDB databases, with the latter two being restricted to very few model organisms. Cluefish, by contrast, seeks to overcome these limitations by incorporating a functional enrichment tool that supports a broader range of organisms. Nevertheless, challenges still remain for lesser-referenced species. While Cluefish improves traditional ORA, it has some limitations. Cluefish relies on STRING’s PPI data, which can be limiting for transcriptional regulators. For instance, Hox genes, despite being functionally linked, often cluster separately due to a lack of direct interactions [[206]101]. This also affects other transcriptional regulators, many of which fall into the lonely cluster. Additionally, Cluefish remains semi-automated due to a manual Cytoscape step. Full automation could be achieved by integrating this step into R using the RCy3 package [[207]102], but querying large gene lists and clustering complex networks is still computationally challenging. We plan to implement full automation as soon as it becomes feasible. Despite these limitations, enhancing the methods and databases used in Cluefish could be highly beneficial. Firstly, a more comprehensive set of databases would greatly increase the workflow’s versatility and applicability. For instance, expanding PPI data to include lesser-studied organisms would broaden its scope, while extending TFlinks [[208]103]—a database of TF-target gene relationships—to cover a wider range of model organisms would significantly enhance biological interpretation in a mechanistic context. Secondly, a major challenge is the complexity and redundancy of biological functions, particularly in GO terms [[209]104, [210]105]. To reduce redundancy, we applied Alexa et al.’s (2006) method [[211]106], which selects representative terms using the hierarchical structure of the GO tree. This seems to be a more effective approach than semantic similarity methods based solely on gene overlap [[212]106]. KEGG pathways also face redundancy issues, and while some methods address this [[213]97, [214]107], simplifying biological functions remains challenging. Finally, ORA assumes that both genes and pathways are independent, meaning that neither topology nor pathway overlap is considered, which limits its scope. Newer enrichment methods (e.g. topology-based) show promise but are not yet compatible with DR modeling metrics, which Cluefish uses. These methods also require highly complete datasets, limiting their application to well-referenced organisms like humans and mice. To enhance Cluefish’s functionality, tools that handle metrics beyond expression values, like BMD and trend metrics, are needed. To the best of our knowledge, Cluefish is the first workflow to integrate DR modeling with the functional enrichment of pre-clustered networks. This comprehensive approach allows for the creation of biologically meaningful clusters, characterized by sensitivity and trend derived from DR analysis. With innovative features such as cluster merging and lonely gene fishing, Cluefish provides users with a powerful tool for a more thorough interpretation of transcriptomic datasets. By offering clearer insights, it facilitates the generation of biological hypotheses and guiding experimental validation. To further establish Cluefish’s advantages over standard methods and demonstrate its versatility, future work should apply the workflow to additional DR datasets. Testing across more organisms, spanning from poorly to well-annotated genomes, and varying numbers of deregulated transcripts will allow for a deeper evaluation of its robustness and broader applicability. While Cluefish employs DR modeling to identify deregulated transcripts and visually explore the results, its core functionality extends beyond this specific tool or analysis type. Cluefish can process a background list of transcripts along with a sub-list, typically consisting of deregulated transcripts. These deregulated transcripts can be derived from a wide range of experimental designs, from the simplest to the more complex ones, such as time or spatial data series. Therefore, it would be interesting to explore the application of Cluefish to other types of data series. Supplementary Material lqaf103_Supplemental_Files [215]lqaf103_supplemental_files.zip^ (1.6MB, zip) Acknowledgements