Abstract
Deep learning techniques hold immense promise for advancing medical image analysis, particularly in tasks like image segmentation, where precise annotation of regions or volumes of interest within medical images is crucial but manually laborious and prone to interobserver and intraobserver biases. As such, deep learning approaches could provide automated solutions for such applications. However, the potential of these techniques is often undermined by challenges in reproducibility and generalizability, which are key barriers to their clinical adoption. This paper introduces the RIDGE checklist, a comprehensive framework designed to assess the Reproducibility, Integrity, Dependability, Generalizability, and Efficiency of deep learning-based medical image segmentation models. The RIDGE checklist is not just a tool for evaluation but also a guideline for researchers striving to improve the quality and transparency of their work. By adhering to the principles outlined in the RIDGE checklist, researchers can ensure that their developed segmentation models are robust, scientifically valid, and applicable in a clinical setting.
Introduction
Medical images are widely used for diagnosis, monitoring, and treatment planning. To analyze these images, often a region of interest (ROI) or a volume of interest (VOI) is manually contoured by a medical expert. In diagnostic imaging, lesion segmentation enables the determination of the area or volume of a lesion of interest, including serial follow-up. It also excludes areas of normal adjacent tissue that do not provide information that is helpful with patient management. This clear delineation of the ROI/VOI is valuable both in oncology and for multiple non-oncologic applications. Segmentation is also an integral part of radiation therapy planning. Furthermore, lesion segmentation could form the first step of more advanced lesion classification tasks using traditional radiomics or deep learning approaches. However, this process is tedious, time-consuming, and prone to interobserver and intraobserver variability [1,2,3,4]. Considering the rare and highly in-demand expertise, semi-automated or automated approaches can accelerate manual segmentation and free expert time. Therefore, segmentation algorithms with the potential to be deployed in clinical settings are of high value. Automated or semi-automated segmentation approaches that are seamlessly integrated into clinical workflow can accelerate research but also can be of value in clinical practice, enabling more routine incorporation of volumetrics in clinical practice, in addition to providing the first step in more complex AI-assisted classification tasks.
Despite the many segmentation models proposed in the literature, relatively few have been deployed for clinical applications. The lack of generalizability and reproducibility of the published methodologies has been one significant obstacle to the widespread adoption of these technologies in clinical settings [5]. Many studies have highlighted that methodological deficiencies are common issues preventing the reproducibility and generalizability of prediction studies [6,7,8].
Several guidelines have been proposed to improve the reproducibility of predictive models in biomedical science. TRIPOD was proposed as a guideline for facilitating transparent reports of studies using multivariate predictive models for diagnosis and prognosis [9]. CONSORT was also proposed as a set of guidelines for reporting randomized controlled trial results [10]. Inspired by CONSORT, the STARD guideline was developed to enhance the reporting of diagnostic accuracy studies [11]. CLAIM was adapted from STARD to extend it to AI applications for medical imaging, including image classification, image reconstruction, and workflow optimization [12]. Although these guidelines contribute to improving the quality of reports on medical image analysis, they are not tailored for segmentation studies. Further, they mainly focus on reproducibility and provide little to no insight into developing generalizable models. As such, they lack the specific guidelines to ensure the generalizability of segmentation approaches.
Inspired by the aforementioned guidelines, this paper provides a checklist to increase the reproducibility and generalizability of machine learning-based segmentation models. The checklist aims to help authors develop more generalizable and reproducible segmentation models. It also assists reviewers in better evaluating academic manuscripts on medical image segmentation.
RIDGE Checklist
The Introduction Section
I-1: Background, Purpose, and How the Segmentation Model Will Be Integrated Into Clinical Workflow
The introduction section should include the required clinical and scientific background for understanding the study and its potential applications and impact. The authors should state the clinical question they wish to address and the current standard of care for the ROI/VOI segmentation. If applicable, state-of-the-art approaches should be mentioned, as well as their drawbacks. Furthermore, the intended use of developed models or methodologies and how they contribute to the clinical workflow should be explained. Lastly, although not an absolute requirement, a paper discussing regulatory considerations (e.g., FDA for the USA), and integration into clinical workflow, along with the automated reporting of key results, would enhance the value of manuscripts, when applicable.
I-2: Study Objectives Regarding State-of-the-Art Segmentation Models
Often, the study objective is to address some shortcomings of the state-of-the-art segmentation approaches. For example, a study objective could be to develop a segmentation model that significantly outperforms the performance of the state-of-the-art models so that the resulting model can add value in a clinical setting. Explicitly stating a study objective(s) enables readers to better understand a study’s contribution. This also sets the expectation for reviewers and facilitates manuscript assessment.
The Materials and Methods
M-1: Prospective or Retrospective Study
It should be indicated whether a study has been conducted prospectively or respectively. Also, some segmentation studies have both retrospective and prospective components. For example, a segmentation model can be developed and evaluated retrospectively and then further evaluated prospectively. Such cases should be appropriately documented.
M-2: Objectives for Segmentation Models: Development, Exploration, Feasibility, or Comparison Studies
Whether the focus is on model creation, conducting an exploratory study, assessing feasibility, or a noninferiority trial, clear definitions and distinctions are essential. Setting a transparent goal helps readers and reviewers evaluate the technical and clinical contributions of the study and sets the context for research methodologies, result interpretation, and understanding of potential implications.
M-3: Data Sources, Including Imaging Modality, Treatment Received, and Protocol for Image Acquisition
Comprehensive details about the imaging modality (e.g., MRI and CT) and the corresponding imaging protocol are essential for the reproducibility of a study. It is important to specify if multiple scanners and different acquisition protocols were used. It is also crucial to specify if the experiments utilized pre-treatment images, post-treatment images, or both and to articulate the rationale behind such choices. This information facilitates readers’ and reviewers’ understanding of the utility and clinical relevance of the proposed approach and ensures a fair comparison with state-of-the-art methods.
M-4: Detailed Information Regarding the Sample Size Used in the Study
The sample size should be explicitly mentioned. Additionally, the composition of samples across known subpopulations or subcategories of interest should be detailed. It is essential to report the sample size at the patient or participant level if multiple data points per individual are used. If relevant, the process behind its selection should be mentioned. Additionally, the sizes or proportions of data used for training, validation, and test sets should be reported.
M-5: Eligibility Criteria
A detailed description of the process for selecting eligible participants should be provided. It is important to state where and when the potentially eligible patients have been identified. It is recommended to present this information through a flow diagram to enhance clarity. This diagram should illustrate the sequential application of each criterion and indicate the exact number of participants remaining after each step. Any potential biases introduced by the selection process and measures taken to address them should also be highlighted. Transparency in reporting key characteristics of the population of interest is not only important for generalizability but also can inform the target population for a given algorithm. A common example is providing information on whether a study included pediatric patients or not, which in turn would inform whether a tool should be used only in adults or whether it can also be used in the pediatric population.
M-6: Detailed Description of Ground Truth Standards to Allow Replication of Image Annotations
It is essential to provide a detailed reference standard that allows medical experts to annotate images without ambiguity. In situations where segmentation boundaries might be subjective and open to varied interpretations by different experts, a comprehensive description should be detailed to minimize interobserver variability. When multiple experts are involved in the annotation process, the methodology employed to reconcile discrepancies and arrive at a consensus for the ground truth should be explained. Alternatively, for certain segmentation tasks, some degree of variation is unavoidable. In these cases, the ground truth segmentation should be performed or edited by subspecialty-trained/certified, experienced professionals. Although there is no absolute rule, typically, segmentation from at least three independent experts is expected, although there are studies in the literature using greater than ten expert segmentations for the same patient. It may be acceptable to use one set of segmentation per patient (either performed by one or multiple experts) for training the algorithm, but the testing or evaluation should ideally be done by comparing multiple expert segmentations as described earlier. If multiple segmentations are being used to train an algorithm, then the process (e.g., use of a vote of majority-generated ROI/VOI) should be clearly reported. For testing or evaluation of the performance of a segmentation algorithm, another approach would be to report and compare the performance of the algorithm against the range of ROIs/VOIs (and consequently variations) of multiple experts.
M-7: Justification of Reference Standards for Ground Truth Image Annotations
In case boundaries of VOIs or ROIs are ambiguous, and several choices can be made for specifying them, the rationale for the choice made in the study should be provided. It is imperative to outline any potential impact the chosen reference might have on the segmentation outcomes or the study conclusions.
M-8: Source of Ground Truth Image Annotations; Qualifications and Training Process for Annotators to Generate Accurate Annotations
The authors should describe the qualifications of the annotators. Also, any training or preparation provided for the annotators before contouring the images should be described. When multiple annotators contour an image, the method for handling discrepancies between these annotations should be described. Also, it is important to state if the contours have been provided manually or in a semi-automated manner, where an algorithm is used to create rough contours that are then manually edited. Lastly, as mentioned earlier, some degree of variation is unavoidable in medical segmentation tasks, and in these cases, the use of multiple expert contours is optimal to ensure reliability and generalizability.
M-9: Tools Used for Image Annotation
The information about the tool(s) used for image annotation should be provided. This includes the name and the version of the software and the underlying operating system. If the annotation software provides multiple contouring tools, it is suggested that the specific ones used for image annotation be listed. When a semiautomated approach is adopted for contouring, details of the method for automatically generating the initial contours, as well as the subsequent refining process, should be described.
M-10: Measuring and Mitigating Interobserver and Intraobserver Variability; Methods for Resolving Annotation Discrepancies
During the data annotation phase, inconsistencies might arise due to multiple observers interpreting samples differently (interobserver variability) or a single observer providing varying annotations for the same sample on different occasions (intraobserver variability). The authors should describe the methods used to quantify interobserver and intraobserver variabilities, possibly through metrics such as Hausdorff distance, Dice coefficient, and Jaccard Index—also known as Intersection over Union (IoU). It is essential to detail any standardized guidelines, training, or protocols provided to the annotators to minimize this variability. Furthermore, the authors should outline the steps or procedures undertaken to resolve discrepancies and ensure annotation consistency.
Model Description
M-11: Detailed Description of Model Architecture, Model Inputs, and Model Outputs
The authors should provide comprehensive details about the model architecture to ensure the model can be reconstructed based on the provided information. The expected input(s) for the model should be explicitly outlined, including image type, size, and preprocessing steps. Similarly, the expected outputs and post-processing steps must be clearly described. If feasible, a link should be provided to a public repository where the code is available.
M-12: Strategy for Initializing Model Parameters
The strategy for model parameter initialization should be described. When a transfer learning approach is employed, it is essential to specify and detail the weights and biases from previously trained models. If pre-trained parameters are used, the authors need to clarify which layers remain open for retraining or weight readjustment tailored to the intended task. If the model is not based on a transfer learning architecture, the method for initializing the model’s parameters should be outlined.
Model Training
M-13: Model Hyperparameters and the Methods for Choosing the Model Hyperparameters
The authors should describe the hyperparameters used in model training, including but not limited to learning rate, optimizer, and loss function. If hyperparameters are determined through a trial-and-error process, this procedure should be described, illustrating the range tested and the criteria for final selection. In cases where systematic hyperparameter tuning methods like grid search, random search, or Bayesian optimization are used, details of the search strategy and results should be included.
M-14: Image Preprocessing Steps
Image processing is often important in machine learning and deep learning applications. Preprocessing steps, if any, should be described with enough detail to allow for reproducibility of the results. These could include steps such as applying intensity normalizations, image cropping, or resizing.
M-15: Image Augmentation
Image augmentation is a common practice in developing deep learning-based segmentation models, specifically in the absence of large-scale annotated datasets [13,14,15]. In the context of image segmentation, data augmentation refers to methods that allow for computationally transforming an image so that the image annotation for the newly generated image can be computationally inferred, alleviating the need for further data collection or manual annotation (see Fig. 1). Image augmentation alleviates overfitting by introducing data variability and artificially increasing sample size. Often, a stochastic pipeline of image augmentation is composed, where a sequence of image augmentations, each with a given probability of being applied, is used to create an augmented version of an input image and its corresponding mask.
Example of image augmentations. The original image (left) has been augmented by zooming (middle) and rotation (right)
To reproduce a data augmentation pipeline, it is essential to provide detailed information about the pipeline. Model-based data augmentation relies on a model to generate synthetic images [15]. When using model-based data augmentations, it is essential to provide information on how to acquire the model and instructions on how to use these models for image augmentation.
M-16: Criteria and Process for Final Model Selection
Model training is an iterative process where a model is updated over multiple epochs. Therefore, it is essential to establish and report criteria for selecting the best model from those developed across different epochs. This selection process can be informed by various performance metrics and stopping criteria. Consequently, reporting the specific performance metrics and the criteria used to halt the training process is crucial. A common approach, for instance, involves monitoring the loss function on the validation and training sets. Decisions can be based on a predetermined threshold for performance improvement on a validation set, a predetermined number of epochs without improvement, or other domain-specific criteria. Understanding these factors offers insights into the model’s reliability and its applicability to real-world scenarios.
M-17: Hyperparameters That Led to the Best Model
Unlike model parameters that are learned based on the training set during the training process, the hyperparameters are often either manually assigned or selected based on some heuristic methods. When using heuristics to choose the model hyperparameters, the validation set is used to select hyperparameters that result in better model performance. Examples of hyperparameters are learning rate, optimization algorithm, momentum, and batch size. The set of model hyperparameters that lead to the best result should be stated in the paper.
M-18: Ensemble Techniques: Model Diversity, Prediction Consolidation, and Computational Considerations (if Applicable)
Due to their high capacity for learning complex problems, deep learning models often exhibit high variance, especially when trained on small datasets. Ensemble approaches combine multiple models to reduce this prediction variance and enhance overall performance [4, 5]. The individual models within an ensemble may vary in their model architectures or in the datasets on which they have been trained. When employing ensemble techniques, it is important to outline these differences. Additionally, the method by which predictions from these models are consolidated to form a final prediction should be clearly described. Also, detailed information about the added computational burden of deploying these models should be provided. The computational requirement might run an approach impractical in some clinical settings.
Model Evaluation
M-19: Metrics for Evaluating Model Performance
Providing correct metrics for model evaluation is essential in developing generalizable models. For example, for problems where the area/volume of interest is a small fraction of the image, pixel/voxel accuracy does not provide a meaningful measure of model performance, as a model that predicts all pixels/voxel not belonging to the area/volume of interest still achieves a high accuracy value. Dice score and Intersection over Union (IoU) are the most common measures used for model evaluation. To facilitate model comparison, providing these measures for all experiments is recommended. Also, for problems where other metrics, such as distance-based metrics, are commonly used to assess model performance, those should also be reported [6, 7].
M-20: Measuring Robustness or Sensitivity Analysis
Robustness refers to the ability of a model to maintain consistent performance despite minor perturbations or changes in input data. Noise and artifacts are common in medical imaging; therefore, segmentation models should be robust in order to be deployed in a clinical setting. Sensitivity analysis for a given model assesses the extent to which variations in input data affect the predictions made by the model. Given the diversity of human anatomy and the variability in medical imaging modalities, understanding which factors most influence model performance can offer key insights into its potential limitations and areas for improvement. Several best practices are recommended to ensure robust performance in medical imaging models. It is important to use established quantitative metrics such as the Dice coefficient and IoU to measure how model performance changes with variation in an input image. Visual comparisons could provide a qualitative sense of model predictions across different levels of image perturbation and noise. It is essential to report both successful and unsuccessful outcomes, as understanding model limitations is especially critical in clinical contexts. Models should also be tested using images from various sources and patient groups to ascertain widespread usability. Lastly, all assumptions about input data made during analysis should be explicitly documented to highlight their potential impact on the model’s clinical performance.
M-21: Internal Validation, External Validation, or Both
In internal validation, a subset of the dataset is used for model training and another subset for model evaluation. In contrast, the model is evaluated using an independently derived dataset in external validation. An external dataset often provides a better estimate of model generalizability and should ideally be the primary method for model evaluation whenever possible. It should be explicitly stated whether the model evaluation is internal, external, or both.
M-22: Level at Which Training, Validation, and Test Sets Are Disjoint (e.g., Patient or Institution)
When developing deep learning models, data are often partitioned into training, validation, and test sets. Partitions should ideally be conducted at the patient or institution level to ensure the same subject does not appear in more than one subset. Data partitioning at the institution level can further enhance model generalizability across different setups and data sources. However, when data from different institutions systematically differ, the model performance measures on these sets might substantially vary. In such cases, these differences should be studied and reported.
M-23: Data Points for Each Subject Are Exclusively Present in Training, Validation, or Test Sets
Due to substantial anatomical similarities between different images from the same patient, a model could associate irrelevant anatomical characteristics to an endpoint of interest instead of learning the condition under study. Consequently, the performance measures do not reflect the true model performance. In the context of segmentation models, this can lead the model to memorize the segmentation map for a patient based on these unrelated characteristics. Therefore, to avoid such issues, data points from the same patient should be confined to just one of the training, validation, or test sets.
M-24: Oversampling Is Not Applied Before Splitting Data into Training, Validation, and Test Sets
Oversampling can contribute to developing segmentation models for imbalanced datasets, particularly for rare pathologies or conditions. However, if oversampling is performed before dataset partitioning, there is a risk that identical images could be distributed across the training, validation, and test sets. This would allow the model to memorize the ROIs/VOIs, resulting in a misleadingly over-optimistic assessment of the segmentation model. Therefore, it is essential to partition the dataset and then apply oversampling on the minority class(es) in the training set. This approach ensures that the model is trained on a more balanced dataset without compromising the integrity of the evaluation process.
M-25: Image Augmentation Is Not Applied Before Splitting Data into Training, Validation, and Test Sets
Data augmentation should only be applied after splitting the dataset into training, validation, and test sets. Although when applying image augmentation, some of the characteristics of the augmented image change, the resulting image still retains a substantial amount of information with the original image (see Fig. 1). Consequently, a model could achieve high-performance measures by memorizing segmentation maps, leading to an overestimation of performance measures and models that lack generalizability. By performing augmentation only on the training set, the overall performance of the model can be improved, without compromising the integrity of the evaluation process.
M-26: Samples in the Test Set Are Not Used to Make a Decision About Preprocessing, Model Training, or Post-processing
Samples in the test set should not be used for selecting preprocessing or post-processing steps or during model training or post-processing. Failing to adhere to this guideline can prevent the test set from providing an unbiased estimate of the model’s generalization error. This oversight might lead to over-optimistic performance measures that do not accurately represent the performance of the model on unseen data.
M-27: Describing Demographic and Clinical Characteristics of Training, Validation, and Test Sets
Demographic and clinical characteristics of samples in training, validation, and test sets should be described better to evaluate the clinical utility of a proposed model and to enhance the reproducibility of the results. For example, age groups (e.g., pediatric vs. geriatric populations) can have significant differences in anatomy, affecting how ROIs/VOIs appear on a medical image. Also, a trained model might perform differently for patients with different treatment histories or disease subtypes, resulting in substantially different performance measures for different compositions of test sets.
M-28: Strategies to Enhance Segmentation Model Robustness to Common Image Variations
Image variations, which are inherent in medical imaging due to factors like diverse acquisition protocols, hardware differences, and software discrepancies, must be effectively managed by a model to ensure its reliability and deployability in a clinical setting.
To address these variations, techniques such as data augmentation and domain adaptation can be employed. Domain adaptation techniques can help the model generalize across different imaging settings by aligning the feature distributions from different domains, ensuring the model performs consistently well regardless of the source of the images. For example, this could enable models to perform well in the presence of image artifacts, noise, or systematic variations in image acquisitions.
M-29: Software Libraries, Frameworks, and Packages
Libraries, packages, and frameworks used for training and evaluating the model(s) should be described with enough detail to allow for reproducing the results.
M-30: Availability of Trained Model and the Inference Code for Segmenting ROIs/VOIs in an Image Provided in a Standard Format, Except when Restricted by Intellectual Property Considerations
The trained model and inference code should preferably be accessible online, enabling readers and reviewers to assess the performance of the developed model with their own datasets or samples, facilitating comparison with current and future research. This accessibility should include both the preprocessing and post-processing pipelines.
The Results Section
R-1: Estimates of Performance Measures and Their Variability
We recommend providing a comprehensive report on the performance of proposed medical image segmentation model. This encompasses not just the primary metric for assessing the performance of the segmentation model, such as Dice or IOU, but also the confidence intervals that reflect the uncertainty of these measures. Furthermore, given the diversity of medical imaging conditions and modalities, it is essential to highlight potential fluctuations in model performance across known subpopulations or subcategories. Factors to consider include variations in patient demographics, imaging equipment, imaging protocols, and external interferences or noise.
R-2: Failure Analysis of Poorly Segmented Cases
Variations in image quality, human anatomy, and pathology, as well as overlapping structures, often lead to errors in predictions made by segmentation models. Segmentation errors typically manifest as false positive segments, false negative segments, or boundary inaccuracies. For example, when normal tissue is predicted as a tumor, it is a false positive segment; when the model partially or completely misses a tumor, the missing part is a false negative segment.
Often, a single measure is used to describe model performance. However, this approach can prevent a comprehensive understanding of model errors, especially in the presence of systematic errors. For instance, when images feature several ROIs of varying sizes, a model that misses small ROIs but accurately identifies large ones could still achieve a high aggregate score, such as IoU or Dice score. This can be misleading, as the model might be medically unreliable. Analyzing these errors offers insights that can guide model refinement. We recommend visualizing examples where a model fails to perform a medically desirable segmentation. The model errors could be assessed quantitatively or qualitatively.
R-3: A Scatter Plot Representing the Distribution of the Size of Region(s) or Volume(s) of Interest for Training, Validation, and Test Sets
We recommend visually assessing the distribution of sizes of the ROI(s) or VOI(s) across the training, validation, and test datasets. Each ROI or VOI can be represented as a point in this scatter plot. The Y-axis of the plot indicates the size/volume of an ROI/VOI, and different colors can be used to represent samples in training, validation, or test sets. This visualization provides insights into potential biases or imbalances in the size distribution of ROIs or VOIs. A balanced and overlapping distribution across the training, validation, and test sets suggests that the model has been trained and evaluated on a representative sample, minimizing the risk of overfitting to a specific range of ROI/VOI sizes or compromising model generalizability. Moreover, examining the overall size distribution of VOIs or ROIs can highlight the clinical utility of these models. For instance, a model primarily trained to detect large lymph nodes might have limited clinical relevance. A visual exploration of the size of ROIs/VOIs can quickly pinpoint such issues. These scatter plots can also highlight potential biases related to various confounders, such as imaging hardware, software, protocol, patient demographics, or medical conditions. This can be achieved by utilizing different shapes or colors for data points representing samples from each category of potential confounders.
Bland–Altman plots and MA plots can also be used to evaluate the discrepancies between model predictions and the ground truth regarding the size of ROIs/VOIs. A Bland–Altman plot, also known as a Tukey mean-difference plot, visualizes the difference between the two measurements against their mean. The MA plot is essentially the Bland–Altman plot for log-transformed values. MA plots use log-transformed values to depict variations from mean values.
R-4: Analyze Bias Across Patient Categories such as Relevant Sociodemographics and Imaging Protocols and Hardware
A model might not achieve the same performance level across all patient population subgroups. These subgroups could be defined based on sociodemographic characteristics such as age or sex, or based on factors such as imaging protocol, imaging hardware, or disease type, to name a few. It is essential to assess the performance of a model across medically relevant groups to avoid bias. By offering a detailed performance breakdown by these categories, we can enhance comprehension of the strengths and weaknesses of a model. Additionally, examining images from diverse categories provides a more comprehensive view of the model’s performance.
R-5: Failure Analysis by Visualizing the Worst-Performing Cases of the Model in the Internal Test Set and, if Applicable, in the External Test Set
A visual review of the most inaccurate predictions from the internal test set and, if applicable, the external test set is highly recommended to rapidly pinpoint areas that need refinement and assist with model improvement. It also uncovers and assists in mitigating inherent biases that the model might have. This rigorous analysis assists readers and reviewers in understanding where a model might falter and informs us about its trustworthiness. Further, by examining potential failures on external datasets, we can verify the ability of a model to generalize across diverse scenarios.
R-6: Performance on External Dataset(s), if Possible, and Explaining Any Statistically Significant Difference Between Performance Measures for Samples in the Internal and External Test Sets
Evaluating the performance of medical image segmentation algorithms on external datasets is crucial to ensure their generalizability across different data sources and conditions. Solely relying on internal datasets may lead to scenarios where a model performs exceptionally well on one specific dataset but fails in real-world scenarios. Statistically assessing any significant discrepancies between the performance of a model on the internal and external test set(s) provides insights into potential biases, limitations, or the robustness of the model, ensuring safer and more reliable clinical application.
The Discussion Section
D-1: Study Limitations, Potential Biases, and Generalization Concerns
Limitations of the study should be clearly detailed in the Discussion section. Potential biases, such as over-representation or under-representation of specific conditions or demographics that might have emerged during data collection, must be emphasized. If there is a lack of external evaluation, or if the external test set may not comprehensively represent the entire population, these issues should be explained. Additionally, any limitations related to study design, data quality, limitations related to the ground truth, or model implementation must be succinctly articulated.
D-2: Practical Utility and Clinical Integration of Segmentation Models
The authors should discuss the practical utility of their model in a medical context. This will help readers and reviewers grasp the significance of the work and understand how it can potentially be integrated into medical practice. Discussion of adequacy or potential challenges to clinical deployment from a regulatory perspective and how the algorithm might be integrated into the workflow for clinical practice implementation would be desirable for a high-impact comprehensive study.
D-3: Highlighting Data Imbalance due to Differences in the Size of ROIs/VOIs and Its Potential Effect on Performance Measures
If there is any data imbalance resulting from varied ROI/VOI sizes, the authors should clearly highlight this. They should also articulate the impact of such imbalance on the performance of the proposed model, including potential improvements or deteriorations if the dataset were balanced. Furthermore, the authors need to discuss the measures they took to address this imbalance, thereby demonstrating the value of their work.
The Conclusion Section
C-1: Concise Presentation
The authors should succinctly describe the contributions of their work in this section. This may include the novelty of the proposed approach, primary contributions, a brief overview of the methodology, the most notable findings of the study, and potential implications or future directions for research.
C-2: Proper Positioning of the Work in the Context of State-of-the-Art Practice, if Applicable
Proper positioning of research within the context of state-of-the-art practices provides readers and reviewers with a clear understanding of how the presented work compares with or advances beyond the current best practices in the field. In instances where particular research does not necessarily surpass the state-of-the-art, it is still important to understand its position in the broader landscape, as this can illuminate complementary aspects, potential synergies, or alternative perspectives.
C-3: Recommendations for Future Work, if Applicable
To guide subsequent studies, it is recommended that the authors detail challenges faced, highlight unaddressed gaps, and suggest potential methods or refinements for the future. This section should also hint at broader areas warranting exploration based on current findings, propose practical real-world applications, and, if applicable, provide an overview of planned subsequent research. By doing so, the paper paves the way for future scholarly endeavors.
C-4: The Conclusion Is Adequately Supported by the Results of the Study
It is essential to ensure that the conclusions of a study are directly derived from, and supported by, the empirical findings presented in the paper. It is imperative that there be no overgeneralization of results. By ensuring that the conclusions are firmly grounded in the actual results, the study preserves its integrity, credibility, and relevance to its audience.
Source Code
S-1: Code Is Made Available, or if Not, Is Justified Within the Manuscript as to Why It Is Omitted
Source code greatly assists readers in comprehending the methodology detailed in the articles. By executing the code in their own environments, readers can conduct experiments with new data using the existing methodology or potentially leverage it to develop new methodologies. This availability of code also facilitates comparisons with future research and the state-of-the-art in the field. Thus, it is recommended that the source code be made accessible and referenced in the article. If, for any reason, the code cannot be made available, authors should articulate the reasons for its omission.
Discussion
In this paper, we proposed the RIDGE checklist, a comprehensive framework designed to enhance the reproducibility and generalizability of biomedical image segmentation models. The RIDGE checklist aims to fill the gaps left by existing guidelines, such as CLAIM, by addressing the unique challenges associated with segmentation tasks in medical imaging. By incorporating detailed criteria and best practices, RIDGE provides researchers and reviewers with a structured approach to ensure that segmentation models are scientifically robust and clinically applicable. RIDGE covers various aspects of model development, from data handling and augmentation techniques to evaluation metrics and bias analysis, thereby facilitating the development and evaluation of reliable and generalizable segmentation models that can be effectively integrated into clinical workflows. Table 1 provides the checklist where guidelines are referred to by a character representing their associated section—Introduction (I), Methods (M), Results (R), Discussion (D), Conclusion (C), and Source Code (S)—followed by a number.
Various checklists—such as CLAIM, STARD, and TRIPOD—have been proposed to improve the reproducibility of research in the medical domain. As these checklists collect and highlight best practices and guidelines, they have substantial overlap. The CLAIM checklist is based on STARD, which in turn was developed based on the CONSORT checklist. There is substantial overlap between different checklists as they incorporate many best practices applicable across different scenarios (see Table S1 in the Supplementary Materials). However, each checklist is designed to cater to specific needs without rendering previous checklists obsolete. Instead, they include specialized guidelines to answer a particular need. RIDGE is also no exception and aims to collect a comprehensive but not cumbersome set of best practices. RIDGE is specifically designed to include and clarify items that enhance the reproducibility of biomedical image segmentation models. While CLAIM provides a robust framework for AI applications in medical imaging, RIDGE introduces additional criteria specifically tailored for segmentation models to address their unique challenges. For instance, criteria such as M-24 and M-25 highlight the importance of applying oversampling and augmentation only after data splitting to prevent data leakage and ensure valid performance metrics. M-26 underscores the necessity of keeping test set samples separate from the decision-making process in preprocessing, model training, or post-processing to maintain an unbiased evaluation of the model’s generalization capability. M-28 focuses on strategies to enhance model robustness to common image variations. RIDGE also includes criteria such as R-3, which advocates for visualizing the distribution of ROI/VOI sizes to identify potential biases, and R-4, which involves analyzing biases across different patient categories and imaging conditions. Finally, D-3 highlights the importance of addressing data imbalances related to ROI/VOI sizes to ensure robust and reliable segmentation models. These additional criteria are critical for developing clinically applicable and generalizable segmentation models.
We assessed RIDGE’s efficacy, usability, and overall value in providing a structured and comprehensive framework for evaluating medical image segmentation articles by evaluating a corpus of already published papers [16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]. Evaluating the generalizability of a model demands substantial experimentation, and our evaluation has not been aimed at reproducing these works or assessing their generalizability. Furthermore, the lack of evidence on following a guideline cannot be inferred as a failure to adhere to that guideline.
Our review of these published works showed that most papers miss one or more critical criteria regarding study reproducibility and model generalizability in their Methodology section. These often omitted or inadequately addressed criteria such as the rationale for choosing the reference standard (M-7), the measurement of and mitigation strategy for interobserver and intraobserver variability (M-10), measuring robustness or sensitivity analysis (M-20), and ensuring that oversampling is not applied before splitting data into training, validation, and test sets (M-24). Additionally, the fact that samples in the test set should not be used to make decisions about preprocessing, model training, or postprocessing is often not stated directly and cannot be inferred indirectly (M-26). The criteria related to the demographic and clinical characteristics of samples in training, validation, and test set (M-27) and the strategies to enhance segmentation model robustness to common image variations (M-28) are often omitted or not adequately addressed.
Our review also identified several areas frequently overlooked or insufficiently addressed in the Result section of these studies. Notably, the distribution of size of ROIs/VOIs across training, validation, and test sets (R-3)—which could facilitate understanding model performance and potential biases—is often not qualitatively or quantitatively highlighted. Additionally, there is a conspicuous gap in the analysis of bias, particularly regarding patient demographics, imaging protocols, and hardware variations (R-4). These could lead to models that lack generalizability and are potentially biased. Moreover, there appears to be a lack of detailed failure analysis, especially in terms of visualizing the worst-performing cases within internal and, where applicable, external test sets (R-5). This type of analysis is crucial for identifying and addressing model weaknesses. Furthermore, the performance of models on external datasets, along with an explanation for any significant differences observed between internal and external performance measures (R-6), is often not reported. If provided, such information could provide a strong signal regarding the generalizability of a model and its utility in clinical settings.
These findings highlight the need for more rigorous and comprehensive reporting to enhance the reliability and applicability of medical image segmentation models in clinical settings.
RIDGE has been primarily designed and evaluated for radiological image segmentation models. While we expect that many of the concepts are generalizable to other biomedical image segmentation tasks, such as histopathological images, we suggest that the evaluation of RIDGE for these applications be considered as future research.
Conclusion
In this manuscript, we proposed the RIDGE checklist to provide a comprehensive set of criteria for evaluating medical image segmentation studies. The RIDGE checklist has the potential to significantly enhance the quality and consistency of research in AI-based segmentation approaches. By emphasizing a thorough review of methodologies, results, and discussions, RIDGE encouraged a more rigorous and transparent approach to study design and reporting. This is particularly crucial in medical image segmentation, which directly impacts clinical outcomes, such as in radiotherapy planning or diagnostic accuracy. The RIDGE checklist can be crucial in guiding researchers toward higher standards in AI-based medical image segmentation studies.
References
Allard, F.D., Goldsmith, J.D., Ayata, G., Challies, T.L., Najarian, R.M., Nasser, I.A., Wang, H., Yee, E.U.: Intraobserver and interobserver variability in the assessment of dysplasia in ampullary mucosal biopsies. The American Journal of Surgical Pathology 42(8), 1095–1100 (2018)
Kulberg, N.S., Reshetnikov, R.V., Novik, V.P., Elizarov, A.B., Gusev, M.A., Gombolevskiy, V.A., Vladzymyrskyy, A.V., Morozov, S.P.: Inter-observer variability between readers of CT images: all for one and one for all. Digital Diagnostics 2(2), 105–118 (2021)
Covert, E.C., Fitzpatrick, K., Mikell, J., Kaza, R.K., Millet, J.D., Barkmeier, D., Gemmete, J., Christensen, J., Schipper, M.J., Dewaraja, Y.K.: Intra-and inter operator variability in MRI-based manual segmentation of HCC lesions and its impact on dosimetry. EJNMMI Physics 9(1), 90 (2022)
Schmidt, A., Morales-Alvarez, P., Molina, R.: Probabilistic modeling of inter- and intra-observer variability in medical image segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21097–21106 (2023)
Kelly, C.J., Karthikesalingam, A., Suleyman, M., Corrado, G., King, D.: Key challenges for delivering clinical impact with artificial intelligence. BMC Medicine 17, 1–9 (2019)
Maleki, F., Ovens, K., Gupta, R., Reinhold, C., Spatz, A., Forghani, R.: Generalizability of machine learning models: Quantitative evaluation of three methodological pitfalls. Radiology: Artificial Intelligence 5(1), 220028 (2022)
Yu, A.C., Mohajer, B., Eng, J.: External validation of deep learning algorithms for radiologic diagnosis: a systematic review. Radiology: Artificial Intelligence 4(3), 210064 (2022)
Hadjiiski, L., Cha, K., Chan, H., Drukker, K., Morra, L., Nappi, J.J., Sahiner, B., Yoshida, H., Chen, Q., Deserno, T.M., et al.: AAPM task group report 273: recommendations on best practices for AI and machine learning for computer aided diagnosis in medical imaging. Medical Physics 50(2), 1–24 (2023)
Collins, G.S., Reitsma, J.B., Altman, D.G., Moons, K.G.: Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) the TRIPOD statement. Circulation 131(2), 211–219 (2015)
Schulz, K.F., Altman, D.G., Moher, D.: CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. Journal of Pharmacology and Pharmacotherapeutics 1(2), 100–107 (2010)
Bossuyt, P.M., Reitsma, J.B., Bruns, D.E., Gatsonis, C.A., Glasziou, P.P., Irwig, L.M., Lijmer, J.G., Moher, D., Rennie, D., Vet, H.C.: Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Family practice 21(1), 4–10 (2004)
Mongan, J., Moy, L., Kahn Jr, C.E.: Checklist for artificial intelligence in medical imaging (CLAIM): a guide for authors and reviewers. Radiological Society of North America (2020)
Buslaev, A., Iglovikov, V.I., Khvedchenya, E., Parinov, A., Druzhinin, M., Kalinin, A.A.: Albumentations: fast and flexible image augmentations. Informa tion 11(2), 125 (2020)
Bloice, M.D., Roth, P.M., Holzinger, A.: Biomedical image augmentation using augmentor. Bioinformatics 35(21), 4522–4524 (2019)
Chen, Y., Yang, X., Wei, Z., Heidari, A.A., Zheng, N., Li, Z., Chen, H., Hu, H., Zhou, Q., Guan, Q.: Generative adversarial networks in medical image augmentation: A review. Computers in Biology and Medicine 144, 105382 (2022)
Kumar, V., Webb, J., Gregory, A., Meixner, D.D., Knudsen, J.M., Callstrom, M., Fatemi, M., Alizad, A.: Automated segmentation of thyroid nodule, gland, and cystic components from ultrasound images using deep learning. IEEE Access 8, 63482–63496 (2020)
Almotairi, S., Kareem, G., Aouf, M., Almutairi, B., Salem, M.A.-M.: Liver tumor segmentation in CT scans using modified SegNet. Sensors 20(5), 1516 (2020)
Sander, J., Vos, B.D., Iˇsgum, I.: Automatic segmentation with detection of local segmentation failures in cardiac MRI. Scientific Reports 10(1), 21769 (2020)
Zhang, Y., Chan, S., Chen, J., Chang, K., Lin, C.-Y., Pan, H., Lin, W., Kwong, T., Parajuli, R., Mehta, R.S., et al.: Development of U-Net breast density segmentation method for fat-sat MR images using transfer learning based on non-fat-sat model. Journal of Digital Imaging 34, 877–887 (2021)
Salama, W.M., Aly, M.H.: Deep learning in mammography images segmentation and classification: Automated CNN approach. Alexandria Engineering Journal 60(5), 4701–4709 (2021)
Sappa, L.B., Okuwobi, I.P., Li, M., Zhang, Y., Xie, S., Yuan, S., Chen, Q.: RetFluidNet: Retinal fluid segmentation for SD-OCT images using convolutional neural network. Journal of Digital Imaging 34(3), 691–704 (2021)
Cho, Y., Kim, M.J., Park, B.J., Sim, K.C., Keu, Y.S., Han, Y.E., Sung, D.J., Han, N.Y.: Active learning for efficient segmentation of liver with convolutional neural network–corrected labeling in magnetic resonance imaging–derived proton density fat fraction. Journal of Digital Imaging 34, 1225–1236 (2021)
Zhang, D., Huang, G., Zhang, Q., Han, J., Han, J., Yu, Y.: Cross-modality deep feature learning for brain tumor segmentation. Pattern Recognition 110, 107562 (2021)
Wang, W., Chen, C., Ding, M., Yu, H., Zha, S., Li, J.: TransBTS: Multimodal brain tumor segmentation using transformer. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pp. 109–119 (2021). Springer
Jalali, Y., Fateh, M., Rezvani, M., Abolghasemi, V., Anisi, M.H.: ResBCDU-Net: a deep learning framework for lung CT image segmentation. Sensors 21(1), 268 (2021)
Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin-Unet: Unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision, pp. 205–218 (2022). Springer
Zhao, L., Asis-Cruz, J., Feng, X., Wu, Y., Kapse, K., Largent, A., Quistorff, J., Lopez, C., Wu, D., Qing, K., et al.: Automated 3d fetal brain segmentation using an optimized deep learning approach. American Journal of Neuroradiology 43(3), 448–454 (2022)
Goel, A., Shih, G., Riyahi, S., Jeph, S., Dev, H., Hu, R., Romano, D., Teichman, K., Blumenfeld, J.D., Barash, I., et al.: Deployed deep learning kidney segmentation for polycystic kidney disease MRI. Radiology: Artificial Intelligence 4(2), 210205 (2022)
Krishnan, A.P., Song, Z., Clayton, D., Gaetano, L., Jia, X., Crespigny, A., Bengts son, T., Carano, R.A.: Joint MRI T1 unenhancing and contrast-enhancing multiple sclerosis lesion segmentation with deep learning in OPERA trials. Radiology 302(3), 662–673 (2022)
Primakov, S.P., Ibrahim, A., Timmeren, J.E., Wu, G., Keek, S.A., Beuque, M., Granzier, R.W., Lavrova, E., Scrivener, M., Sanduleanu, S., et al.: Automated detection and segmentation of non-small cell lung cancer computed tomography images. Nature Communications 13(1), 3423 (2022)
Lin, Y., Lin, Y., Huang, Y., Ho, C., Chiang, H., Lu, H., Wang, C., Wang, J., Ng, S., Lai, C., et al.: Generalizable transfer learning of automated tumor segmentation from cervical cancers toward a universal model for uterine malignancies in diffusion-weighted MRI. Insights into Imaging 14(1), 14 (2023)
Yeung, M., Rundo, L., Nan, Y., Sala, E., Schönlieb, C., Yang, G.: Calibrating the dice loss to handle neural network overconfidence for biomedical image segmentation. Journal of Digital Imaging 36(2), 739–752 (2023)
Wang, Y., Zhang, H., Wang, T., Yao, L., Zhang, G., Liu, X., Yang, G., Yuan, L.: Deep learning for the ovarian lesion localization and discrimination between borderline and malignant ovarian tumors based on routine MR imaging. Scientific Reports 13(1), 2770 (2023)
Ma, X., Hadjiiski, L.M., Wei, J., Chan, H., Cha, K.H., Cohan, R.H., Caoili, E.M., Samala, R., Zhou, C., Lu, Y.: U-Net based deep learning bladder segmentation in CT urography. Medical Physics 46(4), 1752–1765 (2019)
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design and utilized the checklist to review papers. The first draft of the manuscript was written by Farhad Maleki, and all authors commented on all subsequent versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics Approval
This research does not involve human participants, their data or biological material.
Consent to Participate
Not applicable as this research does not involve human subjects.
Consent for Publication
The manuscript does not contain any individual person’s data in any form (including individual details, images, or videos). All visualization content is anonymized and derived from public resources, ensuring that all data used is publicly available.
Competing Interests
FM, LM, PR, BK, AG, DW, RD, MM, AHA, SS, NT, and TK are members of the Machine Learning Education Subcommittee of Society for Imaging Informatics in Medicine (SIIM). LM is the Editor of Radiology with salary support from RSNA and serves on the Editorial Board of JMRI. LM has received grant support from the Siemens Research Grant, the Gordon and Betty Moore Foundation, the Mary Kay Foundation, Google, and NCI/NIH. LM has received personal fees from Lunit Insight, ICAD, Guerbet, and Medscape and is on the Advisory Board for ICAD, Lunit, and Guerbet. LM holds stock options in Lunit and has been reimbursed for meeting and travel expenses by the British Society of Breast Radiology, the European Society of Breast Imaging, and the Korean Society of Radiology. LM is also a member of the ISMRM Board of Trustees and serves on the ACR Data Safety Monitoring Board. RF has had a research collaboration/grant and has acted as a consultant and/or speaker for Nuance Communications Inc., Canon Medical Systems Inc., and GE Healthcare. RF is also a co-investigator on a National Institutes of Health STTR grant subaward and a co-principal investigator on a National Science Foundation grant. FK is the Vice-chair of the SIIM Machine Learning Committee, a member of the RIC at RSNA, a member of the AI committee at RSNA, an Early Career Consultant to the Editor of Radiology, and an Associate Editor for Radiology: Artificial Intelligence. FK is also a consultant for MD.ai, a consultant for GE Healthcare, and a speaker for Sharing Progress in Cancer Care. The rest of the authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Maleki, F., Moy, L., Forghani, R. et al. RIDGE: Reproducibility, Integrity, Dependability, Generalizability, and Efficiency Assessment of Medical Image Segmentation Models. J Digit Imaging. Inform. med. 38, 2524–2536 (2025). https://doi.org/10.1007/s10278-024-01282-9
Received:
Revised:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s10278-024-01282-9
