Abstract
Background
Cell morphology is a complex and integrative readout, and therefore, an
attractive measurement for assessing the effects of genetic and
chemical perturbations to cells. Microscopic images provide rich
information on cell morphology; therefore, subjective morphological
features are frequently extracted from digital images. However,
measured datasets are fundamentally noisy; thus, estimation of the true
values is an ultimate goal in quantitative morphological phenotyping.
Ideal image analyses require precision, such as proper probability
distribution analyses to detect subtle morphological changes, recall to
minimize artifacts due to experimental error, and reproducibility to
confirm the results.
Results
Here, we present UNIMO (UNImodal MOrphological data), a reliable
pipeline for precise detection of subtle morphological changes by
assigning unimodal probability distributions to morphological features
of the budding yeast cells. By defining the data type, followed by
validation using the model selection method, examination of 33
probability distributions revealed nine best-fitting probability
distributions. The modality of the distribution was then clarified for
each morphological feature using a probabilistic mixture model. Using a
reliable and detailed set of experimental log data of wild-type
morphological replicates, we considered the effects of confounding
factors. As a result, most of the yeast morphological parameters
exhibited unimodal distributions that can be used as basic tools for
powerful downstream parametric analyses. The power of the proposed
pipeline was confirmed by reanalyzing morphological changes in
non-essential yeast mutants and detecting 1284 more mutants with
morphological defects compared with a conventional approach (Box–Cox
transformation). Furthermore, the combined use of canonical correlation
analysis permitted global views on the cellular network as well as new
insights into possible gene functions.
Conclusions
Based on statistical principles, we showed that UNIMO offers better
predictions of the true values of morphological measurements. We also
demonstrated how these concepts can provide biologically important
information. This study draws attention to the necessity of employing a
proper approach to do more with less.
Supplementary Information
The online version contains supplementary material available at
10.1186/s12915-022-01283-6.
Keywords: CalMorph, Morphological profile, Phenome, Saccharomyces
cerevisiae
Background
Morphology is a basic phenotypic characteristic that can be affected by
genetic and environmental perturbations. Consequently, living organisms
have their own morphologies that evolve through natural selection
[[29]1, [30]2]. In addition to the morphology of individuals,
morphology can also be defined at the cellular level. Cell morphology
is frequently related to cellular function. While cell morphology
reflects the behaviors and intercellular communication of cells in a
multicellular organism, it is affected by the genotype and genetic
networks in unicellular organisms [[31]3]. Therefore, morphological
phenotyping of unicellular organisms has been carried out to achieve a
global understanding of the cell system as well as to answer specific
questions in cell biology [[32]4, [33]5].
Quantitative morphological phenotyping (QMP; Fig. [34]1) of cells is
performed using a well-designed procedure [[35]6–[36]8]. Following
precise sample preparation [[37]9], the QMP pipeline begins with image
analysis (Fig. [38]1-01). Morphological information can be obtained by
processing digital images of cells. Morphological features are
extracted using various image analysis tools [[39]10, [40]11]. For
example, CellProfiler is used for analyzing images of the tissues and
cells of humans, fruit flies, worms, and yeast [[41]12], while CalMorph
is used for analyzing images of Saccharomyces cerevisiae cells [[42]9,
[43]13–[44]15]. Increased reproducibility and reduced artifacts are
often achieved using image quality control systems associated with the
image analysis tools [[45]16]. After statistical modeling by
transforming and normalizing data (Fig. [46]1-02), the morphological
data are interpreted as biologically meaningful insights (Fig.
[47]1-03) and shared by the community, finally inspiring collaborative
advancements (Fig. [48]1-04) [[49]17]. Although many data analysis
techniques have been developed, more precise and accurate morphological
analytical methods are still desired for QMP.
Fig. 1.
[50]Fig. 1
[51]Open in a new tab
QMP pipeline. Representative workflow of the main steps for QMP. Each
step represents a wide range of analytical methods and various
approaches that might vary from study to study [[52]16]. The underlined
terms represent the main focus of this paper
Because biological measurements are fundamentally noisy, efforts have
been made to estimate true values in quantitative biology [[53]7,
[54]18, [55]19]. The true value of a measurement can be statistically
estimated with an appropriate probability distribution, which is
defined as the mapping between a measurable space and the unitary
interval. Commonly used probability distributions for biological data
are unimodal: Gaussian, reverse Gumbel, gamma, beta, etc; distributions
for continuous variables, as well as binomial, beta-binomial, Poisson,
hypergeometric, etc.; and distributions for discrete variables
[[56]20]. However, it is uncertain whether real biological data, such
as microscope images, follow a simple probability distribution. If a
probability distribution behaves unimodally in the morphology space, a
precise and accurate estimation of the true value of a morphological
measure can be obtained (Fig. [57]2).
Fig. 2.
[58]Fig. 2
[59]Open in a new tab
Possible morphological features used in QMP. A Morphological
phenotyping includes a wide range of research topics and, in each
field, includes various morphometric, densitometric, and
structural/spatial features. B Examples of different types of
distributions (bell-shaped, skewed, and multimodal) that can be found
in morphological measurements are shown. For accurate biological
inference, these characteristics must be properly addressed. This
figure was designed using resources from [60]www.freepik.com
To precisely estimate the true values of morphological data, we
attempted to assign the best-fit unimodal probability distribution
model for each morphological measure. After appropriate probability
distribution models were logically assigned to each of the 501
subjective morphological features defined in S. cerevisiae [[61]9], we
validated them using the Akaike information criterion (AIC) [[62]21].
We then examined the modality of the distribution of the morphological
features. We used existing approaches and R packages to integrate these
sequential steps into a single logical pipeline (UNImodal Morphological
[UNIMO)] data). Through the successful assignment of unimodal
probability distribution models, we can run a powerful parametric
approach, providing new biological information that can be masked in
commonly used methods. Our research highlights the importance of
employing proper analytical tools in phenome studies.
Results
Probability distribution in morphological features
The features defined in cell morphology include the size, density,
number, length, distance, and angle (Supplementary Fig. S[63]1A). In
addition, biological features associated with cell morphology include
comparison measurements obtained, for example, by calculating ratios
between two related morphological measurements, measuring variations in
cell morphology, and comparing proportions of specific types of cells
in the population (Supplementary Fig. S[64]1B). The shapes and patterns
of natural organisms (Fig. [65]2A) do not always follow the complete
Gaussian distribution (Fig. [66]2B). Therefore, to anticipate precise
and accurate values in morphometrics, we must define the best
probability distribution in each scale (Fig. [67]3).
Fig. 3.
[68]Fig. 3
[69]Open in a new tab
Various data types used for quantitative morphological phenotyping.
Morphological measures were divided into five types. Continuous
non-negative, bounded, real, and discrete finite measures were found in
population-level studies, whereas discrete infinite measures were found
in single-cell studies. Each type has a clear definition and distinct
characteristics. Example possible distributions are sorted
alphabetically. Bold font indicates the best-fitted distributions in
this study. For a detailed list, see Supplementary Table S[70]1
Many types of morphological measurement data [[71]22], such as cell
size, brightness, and length, and gradient-based image descriptors used
in the field of computer vision [[72]23], represent continuous
non-negative measures (0 ≤ y; Fig. [73]3), generally showing skewed
distributions. Of the 501 parameters used in the yeast CalMorph system,
183 parameters are non-negative values with skewed shapes where gamma
and Weibull distributions fit well; 177 and 6 parameters, respectively,
based on AIC model selection (Supplementary Fig. S[74]2A; Supplementary
Table S[75]1A).
Ratio measurements, such as the axis ratio, represent the second data
type, continuous bounded values between zero and one (0 ≤ y ≤ 1; Fig.
[76]3), and are also mainly skewed-like beta distributions of the first
kind (hereafter beta distribution). Among the CalMorph parameters, 37
parameters are expressed as ratios and can follow the beta (36
parameters) and logit-normal (one parameter) distributions
(Supplementary Fig. S[77]2B; Supplementary Table S[78]1B).
Morphological noise can be defined by uncoupling the dependency between
the coefficient of variation (CV) and the mean values [[79]24, [80]25].
Continuous positive and negative real values can be obtained (−∞ < y <
∞) (Fig. [81]3) since the data are transformed around the locally
estimated scatterplot smoothing fitted curve at zero (Supplementary
Fig. S[82]3A). As a result, the noise values can potentially generate a
Gaussian distribution. There are 220 noise parameters in the CalMorph
system, and the Shapiro–Wilk normality test revealed that 209
parameters are Gaussian distributed (P < 0.05, after Bonferroni
correction; Supplementary Fig. S[83]3B; Supplementary Table S[84]1C).
Further statistical analysis with the AIC revealed that among the
remaining 11 parameters, five and six parameters fit logistic and type
I extreme value (i.e., reverse Gumbel) distributions, respectively
(Supplementary Fig. S[85]2C; Supplementary Table S[86]1C).
The proportions of discrete count data, such as the proportion of a
specific type of cells, take finite values between zero and one (0 ≤ y
< 1; Fig. [87]3). This type of data can follow either binomial or
beta-binomial distributions, depending on the constantness of the
stochastic event. There are 61 proportion parameters in the CalMorph
system. To know which distribution fits the morphological dataset
better, we performed analysis with the AIC and determined that 38 and
23 CalMorph parameters follow beta-binomial and binomial distributions,
respectively (Supplementary Fig. S[88]2D; Supplementary Table S[89]1D).
We employed the AIC to show that the reference distributions identified
previously [[90]14], including gamma, beta, Gaussian, and binomial
distributions for 177, 36, 209, and 23 CalMorph parameters,
respectively, are better fitted than all other distributions
(Supplementary Table S[91]1). Thus, by defining the data type, followed
by validation with the model selection method, the morphological
parameters of CalMorph consist of four data types that can be defined
by nine distributions, namely, gamma, Weibull, beta, logit-normal,
Gaussian, logistic, reverse Gumbel, beta-binomial, and binomial
distributions (Supplementary Table S[92]1).
Data modality
The modality of the distribution refers to how many modes exist in the
distribution. It reflects the complexity of the distribution as well as
the mixed populations generating the distribution. Therefore, a simple
statistical approach can only be applied to the unimodal distribution
[[93]26]. To estimate the true values during QMP, unimodal
distributions are preferable.
To clarify the modality of the distribution for each morphological
feature, we used a reliable and detailed set of experimental log data
of wild-type morphological replicates. We also used a probabilistic
mixture model to check the modality of the predefined probability
distributions. We confirmed the most likely number of components
(herein, we used 1 ≤ c ≤ 10) using the Bayesian information criterion
(BIC; Supplementary Fig. S[94]4) and iterations of randomization (see
[95]Supporting Text). Each of the 183 non-negative CalMorph parameters
was subjected to gamma or Weibull mixture-model-based clustering. We
found that 150 parameters showed a unimodal distribution. Likewise, 32,
197, and 52 parameters were unimodal out of 37 ratio, 220 noise, and 61
proportion parameters, respectively. In total, 431 (86%) of the 501
CalMorph parameters were unimodal (Fig. [96]4A; Supplementary Table
S[97]2).
Fig. 4.
[98]Fig. 4
[99]Open in a new tab
Modality analysis of the CalMorph morphological parameters. A
Mixture-model-based clustering was used to check the modality of 501
CalMorph parameters according to the data types of (i) non-negative,
(ii) ratio, (iii) noise, and (iv) proportion measures. Using 114
replicates of wild-type yeast cells revealed 431 unimodal and 70
multimodal parameters. B Confounding factors due to the differences in
the experimental conditions and outliers were further considered as
multimodal behaviors. A GLM was introduced by constructing a linear
model (one-way analysis of variance) of the confounding factors for
each parameter. Outlier data points were removed after defining a
threshold for each parameter (one-percentile rule). As a result, 490
unimodal parameters were detected
There were still 70 (14%) CalMorph parameters whose distributions were
multimodal. Because multimodality of a distribution can be caused by
confounding factors and outliers, we investigated these possibilities
in detail. We considered a group of five confounding factors, including
a combination of different filters for microscopy and the image
acquisition period, based on experimental logs (Fig. S[100]5, inset).
We introduced a generalized linear model (GLM), given the probability
distribution for each parameter (Supplementary Table S[101]1), of the
confounding factors (Supplementary Fig. S[102]5) and examined the
multimodality again. In addition, we removed outlier data points after
defining a threshold (one-percentile rule; Supplementary Fig. S[103]6).
As a result, the cumulative unimodal parameters summed up to 180, 37,
212, and 61 for non-negative, ratio, noise, and proportion parameters,
respectively (Fig. [104]4B), increasing the unimodal frequency to 97.8%
(490 out of 501). Thus, most of the CalMorph parameters exhibited
unimodal distributions that can be used as the basic tools for further
statistical analyses. Compared with the Box–Cox power transformation
used in our previous study [[105]9], UNIMO offers many more
morphological parameters (Supplementary Fig. S[106]7, Supplementary
Table S[107]3).
Application of the proposed pipeline
To assess the power of the proposed UNIMO pipeline, we determined the
number of morphologically distinct mutants by analyzing morphological
changes of the 4708 haploid non-essential yeast mutants [[108]9].
Employing 490 unimodal CalMorph parameters, we found that a total of
3522 mutant strains exhibited differences from wild-type cells in at
least one parameter (Fig. [109]5; false discovery rate [FDR] = 0.01;
Supplementary Table S[110]4). Compared with our previous analysis
[[111]9] in which we found 2390 mutants with abnormal phenotypes, this
study showed ~1.5 times greater power (FDR = 0.01; Supplementary Fig.
S[112]8). We also estimated the rate of false positives, which
confirmed that the number of abnormal mutants was not overestimated in
this analysis. These analyses suggested that our approach is precise
enough to capture subtle morphological changes.
Fig. 5.
[113]Fig. 5
[114]Open in a new tab
Morphological phenotyping of yeast non-essential mutants. A Histograms
showing cell sizes at the M stage (C101_C); 4708 mutants and 109 his3
replicates are shown in black and yellow, respectively. Dashed lines
indicate significant thresholds for the UNIMO (purple) and Box–Cox
transformed (orange) methods at FDR = 0.01. Example microscopy images
of yeast cells are shown: actin, cell wall, and nucleus are shown in
red, green, and blue, respectively. White arrows point to cells
detected in the M stage by CalMorph. The unit for size was the number
of pixels squared; for details, see the CalMorph user manual. B Yeast
non-essential genes important for morphology are identified by applying
UNIMO and Box–Cox transformation
Next, we investigated how important biological information can be
obtained by using the UNIMO pipeline in combination with canonical
correlation analysis (CCA). CCA was used to explore the relationships
between two multivariate datasets (in this case, morphological
phenotypes and gene functions). We compressed all combinations of 490
CalMorph parameters and 3127 Gene Ontology (GO) terms into linear
combinations of phenotypic (32 phenotype canonical variables [pCVs])
and gene function (32 gene canonical variables [gCVs]) features (P <
0.05; Supplementary Fig. S[115]9; see the “[116]Methods” section). We
used the correlations between morphological canonical variables to
construct global functional maps of the non-essential yeast genes.
Based on the relationships, we systematically mapped 2915 non-essential
genes belonging to 130 GO terms (Supplementary Table S[117]5). We
observed 44 core gene groups containing 1900 non-essential genes with
various functions, such as ion homeostasis (GO:0050801; 80 members),
structural constituents of ribosome (GO:0003735; 90 members),
generation of precursor metabolites and energy (GO:0006091; 57
members), and carboxylic acid biosynthetic processes (GO:0046394; 54
members); these served as hubs in the decentralized network and showed
more significant associations than the others (Fig. [118]6A and
Supplementary Fig. S[119]10A). When the dendrogram was constructed on
the basis of the proportion of significant correlations, core gene
groups were not clustered (Fig. [120]6B and Supplementary Fig.
S[121]10), implying that even gene groups with a small number of
associations play important and diverse roles through their
interconnections. Our analyses using UNIMO and CCA provided an overview
of the functional relationships between large numbers of non-essential
genes based on morphological phenotypes.
Fig. 6.
[122]Fig. 6
[123]Open in a new tab
Overview of functional relationships between nonessential yeast genes.
A Graphical representation of 2915 nonessential yeast genes with
morphological defects and their functions is shown using the Spring
layout. This network shows the similarities in phenotypes between
pairs, calculated using 32 pCV scores and expressed as correlation
coefficients (see the “[124]Methods” section). Of the 130 functional
groups (P < 0.05, after Bonferroni correction; Supplementary Table
S[125]5), 45 (orange and red dots) and 83 (turquoise and red dots) were
identified as core and dense groups, respectively. Additionally, 44
groups (red dots) were identified as both core and dense groups. Of
these 44 groups, most related GO terms belonged to 19 groups. Numbers
in parentheses represent the number of group members. B Pairwise
phenotypic correlation coefficients between functional gene groups.
Significant correlations are shown as colored cells. Black cells
indicate no significant correlation. The dendrogram was generated based
on the proportion of significant correlations. The bar plot indicates
the number of pairs with significant similarity. The 44 core and dense
groups are shown in yellow (Supplementary Table S[126]5)
We also explored the cellular pathways to evaluate the effect of the
additional biological information obtained by the UNIMO pipeline. The
Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment
analysis, using significantly abnormal mutants identified by the UNIMO
method (n = 3522; FDR = 0.05), included more categories compared to the
Box–Cox method (n = 2390; FDR = 0.05), such as metabolic pathways,
autophagy processes, and protein processing (Supplementary Fig.
S[127]11 and Supplementary Table S[128]6). This indicates the
importance of the additional information obtained by UNIMO.
We mapped the newly detected cellular pathway by UNIMO to the global
functional maps and found that the map contained information at
multiple resolutions. For instance, we mapped 32 autophagy genes (Group
73; Supplementary Table S[129]5) on the functional map at a higher
resolution (Supplementary Fig. S[130]12A), which enabled the detection
of common morphological defects (Supplementary Fig. S[131]12B). We
found that atg17∆ cells were the most defective cells, exhibiting
significant differences in 90 morphological parameters (Supplementary
Table S[132]4). Among the atg mutants in this group, the morphology of
atg11∆ cells were similar to that of the ATG17 knockout cells (r =
0.48, two-sided t test, P < 0.01). We further investigated the
morphological similarities between atg11∆ and atg17∆ mutants
(Supplementary Fig. S[133]12C) in which gene deletions were
characterized by altered mother cell shapes (C115_A, C115_A1B, C103_A,
and C103_A1B), larger bud size (C11-2_C and C12-2_C), altered bud shape
(C107_C), and altered nuclear morphology (D134_C, D185_C, and D186_C).
Cell shape (C115_A; whole-cell axis ratio at the G1 stage) was
similarly elongated in 20 mutants (green nodes in Supplementary Fig.
S[134]12B), indicating that these atg mutants had common morphological
characteristics.
We successfully applied UNIMO to an archive morphological dataset of
yeast non-essential genes and detected subtle morphological
abnormalities that were ignored in our previous study [[135]9].
Visualizing the global functional network and pathways enrichment
analysis also provided novel biological insights.
Discussion
The current study presents the reliable UNIMO pipeline for
understanding the true values of morphological measures by assigning
nine unimodal distributions to their probability distributions. After
possible distributions were assigned to the morphological measures, the
distribution models were validated with the AIC. We succeeded in
preparing unimodal probability distributions for most of the yeast
morphological features to ensure parametric analysis is applied to
obtain biologically important information in the downstream analyses.
This was conventionally overlooked previously, which can actually
confine the power to distinguish morphological variations.
Multimodality may instead lead to misinterpretation of the observed
phenomena and affect the reproducibility of the results at certain
points.
Advantages of using unimodal probability distributions
Once the unimodal probability distribution models are defined for the
morphological measures, any appropriate parametric methods can be
employed in the downstream analyses. Application of parametric
approaches allows for sensitive morphological distinctions: we used a
relevant approach to identify more than half of the essential yeast
genes as morphological haploinsufficient genes [[136]14]. Another
advantage is that we can use the GLM, an extension of the normal linear
model [[137]27], for hypothesis testing. Furthermore, we can employ
various machine learning approaches to recognize, predict, understand,
and obtain data/knowledge. In this way, parametric approaches can be
applied in the future to perform correlation analyses to compare
morphologies [[138]28], classification analyses to distinguish
categories based on morphology [[139]29], prediction analyses to
identify similar morphologies [[140]30], factor analyses to explore
potential common factors [[141]31], analysis of morphological diversity
for breeding purposes [[142]32, [143]33], and analysis of sources of
bias in microfluidic cell culture research [[144]34].
Various probability distributions fitted to the morphological data
We showed that our QMP data contained four different data types that
can well be explained by nine distributions. Defining the best
probability distribution model is one of the most important factors in
this study. However, we believe that the same nine probability
distributions may not always be selected during QMP. For example,
non-negative values may form heavily skewed probability distributions
(e.g., inverse Gaussian). At the single-cell level, Poisson or
negative-binomial distributions are commonly used for count analyses.
Eventually, for non-negative, ratio, and proportion measures, various
inflated distributions (zero-, one-, or both, according to the
distribution) should also be considered to accommodate special cases in
which there is a mixture of discrete values and a continuous
distribution. Thus, while this study provides a solid foundation for
phenome studies, researchers may need to identify the best-fitting
model to accommodate specific morphological measures.
Unimodal distribution of the morphological measures
Data modality is another concept that we took into account in this
study. Since simple explanations are preferred until the data justify a
complex model, a unimodal distribution is better than a multimodal
distribution a priori. This is the principle of Occam’s razor that
entities should not be multiplied needlessly. But more practically, a
unimodal probability distribution is better because it can be employed
for parametric analysis, providing a simple assumption to detect
significant changes and improve accurate interpretation of the results.
We found that most of the probability distributions of the yeast
morphological features are unimodal. However, unimodal features may not
be always prepared. The reason why this study was successful is that
many single-cell parameters of CalMorph exhibited unimodality
[[145]35], according to the automatic classification of cells based on
the cell cycle stage [[146]9]. We believe that an important benchmark
is the number of unimodal, single cell parameters when applying UNIMO
to any image processing tool. In the case of multimodality, we can use
distinct cell attributes, such as the cell cycle stage [[147]9], cell
shape [[148]16], and cell type [[149]36, [150]37]. Furthermore, it is
essential to consider the effects of confounding factors. The 11
remaining multimodal parameters (Supplementary Fig. S[151]13) may be
either unknown confounders or multimodal as a property of the
morphological features in the first place.
Finally, to illustrate the impact of this study, we thought it might be
possible to apply UNIMO to archived imaging data. However, affirming
this is challenging because labs do not publicly share morphological
measures of standard data sets with experimental logs alongside the
perturbed cells. The [152]Supporting Text, Supplementary Fig. S[153]14,
and Supplementary Table S[154]8 elaborate on the versatility of this
pipeline using a hypothetical standard data set of morphological
measures obtained by CellProfiler [[155]12] and using data from
[[156]38]; Supplementary Fig. S[157]15 and Supplementary Table S[158]9
show the results.
Conclusion
The estimation of true values is the ultimate goal in quantitative
morphological analyses. Due to its simple mathematical and
computational specifications, biologists tend to first try the Gaussian
distribution, although morphological features are not always normally
distributed. Non-normal data are transformed to obtain approximately
normal distributions, but the first choice after normalization failure
is a non-parametric method. Nevertheless, to detect subtle differences,
higher statistical power is desirable; therefore, the application of
parametric approaches allows for clearer morphological distinctions. In
this study, we demonstrated a better prediction method of the true
values of morphological measurements using UNIMO. We also demonstrated
how these concepts can provide biologically important information. Our
study offers a framework for future phenome studies and enables further
development of a typical QMP pipeline.
Methods
Selection of the probability distribution for each morphological measure
We determined models of the probability distributions for each of the
501 morphological parameters to accommodate the statistical model used
in the GLM [[159]20]. The AIC was employed to find the best-fitted
model. Figure [160]3 and Supplementary Fig. S[161]1 present the logic
behind the procedure. More details are also described in the
[162]Supporting Text.
Population level modality check
We used mixture-model-based clustering [[163]39, [164]40] to check the
modality of 501 CalMorph parameters according to predefined
distributions (Supplementary Table S[165]1). The BIC was used to
compare the primary probability models that differ in the number of
components (c); a mixture of 1 ≤ c ≤ 10 distributions was tested.
Supplementary Fig. S[166]4 illustrates a flowchart of the modality
check. Further explanations are provided in the [167]Supporting Text.
Tracking morphological variations in the non-essential gene mutants using the
proposed pipeline
To test the effectiveness of the proposed pipeline, we reanalyzed
morphological variations in the 4718 non-essential yeast gene mutants
[[168]9]. Of the 4718 open reading frames (ORFs), 4 (YAR037W, YAR040C,
YGL154W, and YAR043C) were deleted from the Saccharomyces Genome
Database (SGD; [169]https://www.yeastgenome.org/), and 6 were merged
with other ORFs (Supplementary Table S[170]4). Ultimately, we compared
4708 mutants to a dataset of haploid wild-type yeast strains (his3; 109
replicates [[171]41]) using UNIMO. Models of the probability
distributions of the 490 UNIMO parameters were used to accommodate the
pre-defined statistical models (Supplementary Table S[172]3). We
calculated the P value as the deviation of each mutant from the
wild-type strains based on the estimated two-tailed probability
distribution using the distribution probability functions in the gamlss
package (e.g., pGA, pBE, pNO, and pBI functions). For each function,
the P value was calculated from the mean and dispersion (if applicable;
for example, binomial distribution does not consider dispersion) of the
model fitted to the wild-type population. The FDR, i.e., the rate of
type I errors for the rejected null hypothesis due to multiple
comparisons, was estimated using the q value function in the q value
package [[173]42]. Ultimately, we estimated the number of mutants with
significant changes, if any, and abnormal morphology in at least one
parameter (FDR = 0.01).
Z value transformation
To estimate Z values, after the maximum likelihood estimation (MLE) had
converged, morphological data were transformed to Z values using the
Wald-test (one-sample two-sided test using the summary.gamlss function
of the gamlss package), where the Z value in the ith parameter of the
jth mutant is [[174]20]:
[MATH: Zij=
βij-βi0SE
ij :MATH]
Here, β[ij] is the MLE of the jth mutant, β[i0] is the MLE of the null
distribution (109 replicates of the haploid wild-type yeast strains),
and SE[ij] is the standard error. For the beta-binomial distributed
parameters (Supplementary Table S[175]3D), if the numerator of a mutant
was equal to estimated value from the wild-type population, the Z value
was set to zero; otherwise, it was set to the maximum value of the
other mutants.
Construction of the morphological-functional network
Gene Ontology annotation
For the GO annotation analysis, we downloaded the basic version of the
GO file from the GO Consortium ([176]http://geneontology.org/) and gene
annotations from the SGD. A Boolean matrix of GO terms of 3522 genes
with abnormal morphologies (UNIMO; FDR = 0.01) was generated as the
functional profile (if a gene was annotated by a GO, the value was 1;
otherwise, it was 0). We selected 3127 GO terms, each of which was
annotated for more than two genes (i.e., removing unique terms) and
less than 200 genes (i.e., removing global terms) in the genome with no
identical sets of annotated genes. We excluded 577 genes that were not
annotated by the 3127 GO terms (i.e., removing genes with no functional
reports).
Canonical correlation analysis
We used CCA to reduce the dimensions from 490 parameters and identify
biologically important morphological features. For this purpose, we
used the Z values of 2945 genes as the morphometric profiles (i.e.,
2945 × 490) and the Boolean matrix of GO terms as the functional
profile (i.e., 2945 × 3127). To reduce the dimensionality, we subjected
the morphometric profiles to principal component analysis (PCA; prcomp
function of the R stats package). The initial 146 PCs (hereinafter
referred to as phenotype principal components; pPCs) accounted for more
than 95% of the data variation (i.e., the cumulative contribution
ratio; CCR). Next, to estimate the functional relationships among the
2945 genes, we used the structure of 3127 GO terms. The dimensionality
of the functional profiles was reduced by PCA according to [[177]43],
with some modifications. Briefly, we applied PCA on the Boolean matrix
of all genes (i.e., 2945 × 3127). This approach reduces the
dimensionality while preserving the structure of the functional
relationships among the genes. The first 1474 PCs (hereinafter referred
to as GO term principal components; gPCs) explained 99% of the data
variation, indicating that approximately 1474 gene functions were
related to the 2945 genes.
After projection of the Z values on pPCs and a zero matrix on gPCs for
109 replicates of the wild-type population, we applied CCA to 146 pPCs
(CCR = 95%) and 1474 gPCs (CCR = 99%). The significance of the
canonical correlation coefficients was tested using Bartlett’s
chi-squared test with the significance level set at P < 0.05 [[178]44].
Ultimately, 32 morphological features (pCVs) and 32 gene function
features (gCVs) were obtained (Supplementary Fig. S[179]9).
Pairwise canonical correlation analysis
We divided the genes into functional groups with no common term. The
2945 nonessential genes with at least one GO annotation by the 3127 GO
terms were clustered into disjunctional functional gene groups using
common GO annotations. The binary distance between each pair of genes
was calculated on the basis of a Boolean matrix of 3127 GO terms and
applied for hierarchical cluster analysis by the complete linkage
method (i.e., minimum ratio of the different genes between clusters;
hclust R function). Using “static branch cutting” at a height < 1, 150
gene groups were identified, each of which contained 1–92 genes. We
selected groups with at least two members (130 groups). To assign the
most appropriate GO terms to each gene group, enrichment of the GO
terms was analyzed using Fisher’s exact test (P < 0.05, after
Bonferroni correction). In 122 of the 130 groups, more than one GO term
was enriched. The remaining groups were therefore identified as
functional gene groups with no GO terms in common. For each of the 122
groups, the GO term detected using the lowest P value was selected as
the representative function (Supplementary Table S[180]5).
Next, we calculated pairwise correlation coefficients between the
functional gene groups. To detect significant relationships between the
gene groups, we performed pairwise CCA between arbitrary pairs of 130
gene groups ([130]C[2] = 8385) using 32 pCV scores. To prevent
detection bias, we used a smaller number of genes than the number of
pCVs by reducing dimensionality of the genes after applying PCA to the
data of nonessential genes. For pairwise CCA, we applied CCA to pCV
scores using genes and/or selected PCs as variables and extracted
canonical variables of the gene deletion mutants (mCVs) as independent
components with correlations between gene groups. We tested the
significance of the canonical correlation coefficient of the first mCV
at P < 0.0005, after Bonferroni correction using Bartlett’s chi-squared
test [[181]44]. Among 8385 pairs in 130 gene groups, 472 had
significant relationships between groups.
Visualization of the phenotypic correlations
The similarity of phenotypes between pairs (among the 2945 nonessential
gene deletion mutants) was calculated using 32 pCV scores and expressed
as correlation coefficients (r). To visualize the network of the 122 GO
term-enriched gene groups (2890 genes) with significant relationships
to other groups, we used the qgraph package and Spring layout
[[182]45]. We populated the matrix of pCV score-based correlation
coefficients after zero-filling cells using the qgraph function, when
no significant correlation was detected in mCV1 between groups or the
absolute value of the correlation coefficient between genes of
different groups was not the maximum.
Core gene groups
We identified core gene groups with significant correlations. We
divided the groups into two clusters by applying Poisson
mixture-model-based clustering and identified groups having significant
correlations with nine or more other groups as core groups
(Supplementary Fig. S[183]16A).
Dense gene groups
We identified dense gene groups, in which the genes were closer to each
other in the correlation network. We divided the groups into two
clusters by applying gamma mixture-model-based clustering to the
average distance between arbitrary pairs of genes in the
two-dimensional network and identified the groups with an average
distance between genes of < 0.2 as dense groups (Supplementary Fig.
S[184]16B).
Pathway enrichment
Functional categories of significant ORFs detected by the UNIMO and
Box–Cox transformation methods (3522 and 2390 ORFs, respectively) were
enriched in the KEGG database using clusterProfiler package [[185]46]
at FDR = 0.05.
Supplementary Information
[186]12915_2022_1283_MOESM1_ESM.zip^ (6.8MB, zip)
Additional file 1: Supplementary text. Supplementary Figure S1.
Possible values of cellular features used in QMP. Supplementary Figure
S2. Probability models for CalMorph measures. Supplementary Figure S3.
Conversion of the CV to noise values. Supplementary Figure S4.
Flowchart of the methodology for checking the modality of the CalMorph
parameters. Supplementary Figure S5. An example of the effect of
confounding factors on modality. Supplementary Figure S6. An example of
the effect of outliers on modality. Supplementary Figure S7. Comparison
of this study with our previous results. Supplementary Figure S8.
Comparison of results between UNIMO and Box-Cox transformed methods.
Supplementary Figure S9. Canonical correlation analysis used for
extraction of 32 pairs of canonical variables. Supplementary Figure
S10. Phenotypic similarity network of non-essential genes.
Supplementary Figure S11. Enrichment of KEGG categories. Supplementary
Figure S12. Morphological defects of autophagy mutants. Supplementary
Figure S13. Multimodal CalMorph parameters. Supplementary Figure S14.
Outlines of generalization of UNIMO. Supplementary Figure S15. An
example of versatility of UNIMO. Supplementary Figure S16. Defining
characteristics of the 130 functional groups. Supplementary Table S1.
Selection of the probability distribution for each morphological
parameter; non-negative (A), ratio (B), noise (C), and proportion (D)
measures. Supplementary Table S2. Population level modality check for
each morphological parameter; non-negative (A), ratio (B), noise (C),
and proportion (D) measures. Supplementary Table S3. The best
probability distribution and final modality for each morphological
parameter; non-negative (A), ratio (B), noise (C), and proportion (D)
measures. Supplementary Table S4. List of non-essential mutants.
Supplementary Table S5. List of the representative GO terms enriched in
each functional group. Supplementary Table S6. KEGG pathway enrichment
(FDR = 0.05). Supplementary Table S7. Data types and probability
distributions employed for a typical quantitative morphological
phenotyping (QMP) experiment. Supplementary Table S8. List of
measurements extracted by CellProfiler. Supplementary Table S9. An
example of generalization and versatility of UNIMO using morphological
data presented in Mattiazzi Usaj et al. (2020).
Acknowledgements