A Cascaded DERNet and YOLO11 Framework for Spinal Lesion Triage and Localization with Explainable AI

Anonymized Affiliations
Vertebral Collapse

(a) Vertebral Collapse

Osteophytes

(b) Osteophytes

Spondylolisthesis

(c) Spondylolisthesis

Surgical Implant

(d) Surgical Implant

Disc Space Narrowing

(e) Disc Space Narrowing

Foraminal Stenosis

(f) Foraminal Stenosis

Other Lesion

(g) Other Lesion

Normal

(h) Normal

Automated Lesion Detection and Localization: Visual demonstration of the cascaded DERNet-YOLO11 framework on VinDr-SpineXR benchmark images. The figure presents eight representative cases including seven pathological conditions with precise bounding-box localization: (a) Vertebral Collapse, (b) Osteophytes, (c) Spondylolisthesis, (d) Surgical Implants, (e) Disc Space Narrowing, (f) Foraminal Stenosis, (g) Other Lesions, alongside (h) a Normal spine radiograph for comparison. Each pathological case demonstrates the model's capability to accurately detect and spatially localize subtle lesions despite significant class imbalance (46.9:1 ratio), small object scales (often <1% of image area), and anatomical structure overlap. The localization boxes validate the clinical applicability of the integrated triage–localization pipeline.

Methodology Diagram

Overview of the cascaded framework for spinal lesion triage and localization.

Abstract

Automated analysis of spinal radiographs is essential for early diagnostic triage but remains challenging due to the visual subtlety and small scale of many spinal lesions. We propose a unified cascaded deep learning framework designed to improve screening sensitivity while maintaining precise lesion localization. The diagnostic workflow is structurally decoupled into two stages. First, DERNet, a heterogeneous ensemble of EfficientNetV2-S, DenseNet121, and ResNet50, performs high-sensitivity binary triage to filter normal radiographs. Abnormal cases are subsequently routed to a customized YOLO11-L detector for fine-grained lesion localization. Experimental evaluation on the VinDrSpineXR benchmark demonstrates strong performance, achieving an AUROC of 91.03% for image-level classification and a mAP@0.5 of 40.10% for lesion-level detection. To enhance clinical interpretability, we integrate explainability techniques including LIME, Grad-CAM, and qualitative visualization, ensuring that predictions are aligned with anatomically relevant structures. The proposed framework provides an interpretable and efficient solution for automated spinal radiograph analysis. Additional resources, including extended experimental analyses, error analysis, qualitative visualizations, real-time demos, and reproducibility resources, are available at the DERNet project website.

Cascaded DERNet-YOLO11 Framework: Dual-Stage Triage-Localization Architecture

đź”´ Clinical Problem

  • Extreme class imbalance: 46.9:1 ratio
  • Small lesions: <1% field-of-view (~8,800 px²)
  • Dual requirements: Triage + Localization
  • Interpretability gap: Black-box AI limitations

🟡 Clinical Workflow

  • Stage 1: Binary triage for abnormality screening
  • Stage 2: 7-class localization with bounding boxes
  • Explainable AI: Visual validation support
  • Performance: 84.91% sensitivity, 81.68% specificity

🟢 Architecture

  • Stage 1: DenseNet-121 + EfficientNetV2-S + ResNet-50 ensemble (91.03% AUROC)
  • Stage 2: YOLO11-L with CSPDarknet + PANet (40.10% mAP@0.5)
  • Dataset: VinDr-SpineXR (10,468 images)
  • Real-time: 11 FPS on RTX 3050 GPU
DERNet Dual-Stage Architecture

Figure: Cascaded DERNet-YOLO11 dual-stage architecture combining ensemble classification (Stage 1) and object detection (Stage 2) for automated spine lesion triage and localization.

Implementation Details & Experimental Configuration

Specific parameter settings for the DERNet triage threshold, YOLO11 detector, and spinal radiograph enhancement.

Threshold Optimization

Optimal Threshold $\tau^* = 0.478$
Selected via grid search to maximize F1-score on validation set.

  • Range: $[0.35, 0.60]$ (default 0.5)
  • Step size: $\Delta\tau = 0.0002$
  • Metric: 5-fold CV average
Method Comparison:
Method $\tau^*$ F1 (%)
F1-Max 0.478 83.09
Youden's J 0.462 82.84
Default 0.500 82.63
Balanced Acc 0.485 83.01

Why F1-Max? More robust to class imbalance (46.9:1) than Youden or Accuracy.

YOLO11 Configuration

optimized on COCO dataset
Hyperparameters determined via combination search:

Param Value Rationale
$\gamma$ (Focal) 2.0 Balances easy/hard examples
$\alpha$ (Focal) 0.25 Optimal for minority foreground
$\lambda_{box}$ 7.5 Box regression weight
$\lambda_{cls}$ 0.5 Class loss weight

Augmentation: Copy-Paste ($\alpha=0.2$) utilized to address severe class imbalance (46.9:1).

Radiograph Preprocessing

Strategy: CLAHE enhancement to mitigate exposure variance.

$$I'(i, j) = \beta \cdot \frac{CDF_{\Omega_k}(I(i, j)) - CDF_{min}}{|\Omega_k| - CDF_{min}}$$
Param Value Purpose
Clip Limit 2.0 Limits contrast noise
Grid Size $8 \times 8$ Local equalization
Scale $\beta$ 255 8-bit mapping

Preprocessing ensures consistent feature extraction across 8,389 training images.

Performance Evaluation

Comprehensive analysis of the proposed DERNet model demonstrating superior accuracy and clinical relevance.

LIME Explainability

(a) LIME Explainability

LIME explainability identifies important regions for mild, moderate, and severe cirrhosis classification.

Why it matters: Confirms the model focuses on clinically meaningful structures.

Grad-CAM Analysis

(b) Grad-CAM Analysis

Grad-CAM visualizations indicating regions critical for cirrhosis stage prediction.

Why it's better: Provides transparent, clinically interpretable validation.

Qualitative Visualization

(c) Qualitative Visualization

Comparison of the ground truth mask and predicted mask along with error analysis for the DERNet model.

Why it's better: DERNet approach preserves local anatomical details often missed by Transformers.

Classification Ensemble Performance

Performance comparison of DERNet ensemble (DenseNet-121 + EfficientNetV2-S + ResNet-50) against individual models and baseline using 5-fold cross-validation.

Table 1: Classification Performance (5-Fold Cross-Validation)

Model Parameters AUROC (%) Sensitivity (%) Specificity (%) F1-Score (%) Weight
DenseNet-121 8.0M 86.93 80.39 79.32 79.55 0.42
EfficientNetV2-S 21.5M 89.44 70.80 91.12 79.34 0.32
ResNet-50 25.6M 88.88 82.72 78.13 80.15 0.26
VinDr Ensemble [2] - 88.61 83.07 79.32 81.06 -
HealNNet [15] - 88.84 - - 81.20 -
DERNet Ensemble 18.3M avg 91.03 84.91 81.68 83.09 -

Detection Performance & Analysis

Evaluation of YOLO11-l for spinal lesion localization across 7 pathology types with comparison to baseline methods.

Table 2: Comparison results of different methods (mAP@0.5)

Method LT2 LT4 LT6 LT8 LT10 LT11 LT13 mAP@0.5
Dino [19] 16.58 22.87 28.53 32.71 59.78 41.28 3.24 29.28
RetinaNet [20] 14.53 25.35 41.67 32.14 65.49 51.85 5.30 28.09
Faster R-CNN [9] 22.66 35.99 49.24 31.68 65.22 51.68 2.16 31.83
Sparse R-CNN [10] 20.09 32.67 48.16 45.32 72.20 49.30 5.41 33.15
VinDr-SpineXR [2] 21.43 27.36 34.78 41.29 62.53 43.39 4.16 33.56
EGCA-Net [18] 22.36 29.75 36.73 44.69 66.58 50.41 2.09 36.09
Ours (YOLO11-L) 26.70 41.40 40.60 54.80 74.10 51.20 2.99 40.10

(*) LT2, LT4, LT6, LT8, LT10, LT11, LT13 denotes for disc space narrowing, foraminal stenosis, osteophytes, spondylolisthesis, surgical implant, vertebral collapse and other lesions, respectively.


Table 3: Ablation Study - Component Impact Analysis

Component Removed mAP@0.5 ΔmAP Impact Level
Full Model (Baseline) 40.10% - -
- Copy-Paste Augmentation 36.2% -3.84% High
- Mosaic Augmentation 37.1% -2.94% High
- C2PSA Module (Attention) 38.5% -1.54% Medium
- Focal Loss 38.9% -1.14% Medium
- Task-Aligned Assignment 39.2% -0.84% Low

Table 4: Computational Performance Analysis

Model Parameters FLOPs Inference Time Training Time GPU Memory
DenseNet-121 7.98M 5.72G 18ms (~56 FPS) ~12h (60 epochs) 3.2GB
EfficientNetV2-S 21.46M 8.40G 24ms (~42 FPS) ~15h (60 epochs) 4.8GB
ResNet-50 25.56M 11.6G 28ms (~36 FPS) ~14h (60 epochs) 5.1GB
Ensemble (Average) 18.33M 8.57G 70ms (~14 FPS) ~41h total 4.37GB avg
YOLO11-l 25.27M 164.9G 22ms (~45 FPS) ~18.5h (50 epochs) 6.2GB
Combined Pipeline 43.6M 173.47G ~92ms (~11 FPS) ~59.5h total 10.57GB peak

Error Analysis

Detailed examination of model performance through confusion matrices and analysis of misclassified instances.

Confusion Matrix

(a) Confusion Matrix

The class-wise confusion matrix provides a granular breakdown of the model's diagnostic accuracy across all seven spinal lesion categories. By visualizing true positives versus misclassifications, we quantitatively assess the impact of severe class imbalance (e.g., distinguishing rare "Vertebral collapse" from frequent "Osteophytes"). The matrix reveals high diagonal density, confirming robust sensitivity even for minority classes, while highlighting specific inter-class ambiguities—such as the subtle overlap between "Disc space narrowing" and "Spondylolisthesis"—that guided our refined architectural choices.

False Positive Analysis

(b) False Positive Analysis

A systematic investigation of False Positives (FPs) elucidates the model's decision boundaries in complex radiological environments. We examine instances where normal anatomical variations, high-density bone structures, or external artifacts (e.g., surgical clips) were incorrectly flagged as pathological. This analysis is critical for clinical deployment, as reducing FPs—particularly in triage settings—minimizes unnecessary follow-ups. Our findings show that the majority of FPs occur in low-contrast regions, necessitating the integration of context-aware attention mechanisms to suppress background noise.

Missed Lesions Analysis

(c) Missed Lesions Analysis

A comprehensive review of False Negatives (FNs) uncovers the morphological characteristics of spinal lesions that evade detection. We analyze 'missed' cases to identify patterns such as extreme subtlety (lesions occupying <1% of the FOV), occlusion by overlapping structures, or atypical visual presentations. Understanding these failure modes—specifically in 'Other lesions' and early-stage 'Foraminal stenosis'—drives future improvements in multi-scale feature extraction. This rigorous audit ensures transparency and targeted refinement, directly addressing the safety-critical requirement of minimizing missed diagnoses in clinical workflows.

Key Findings

Summary of major breakthroughs in automated spine pathology triage and localization.

🎯

Robust Classification

DERNet ensemble: AUROC 91.03%, Sensitivity 84.91%, Specificity 81.68%, F1-Score 83.09% via weighted fusion of DenseNet-121, EfficientNetV2-S, ResNet-50.

⚖️

Class Imbalance Mitigation

Handled 46.9:1 class imbalance using Copy-Paste augmentation and Focal Loss, achieving mAP@0.5 40.10±0.3% for 7-class detection.

đź§ 

Explainable AI (XAI) Integration

LIME, Grad-CAM, and Qualitative Visualization provide saliency maps and visual validation, ensuring model focuses on clinically relevant spinal regions for transparent diagnosis.

🏅

Real-Time Clinical Applicability

~11 FPS (92ms/image) on RTX 3050 GPU for real-time spine triage with 7 pathology localization and bounding-box visual explanations.

Live Clinical Deployment Interface

Real-world implementation of the cascaded DERNet and YOLO11 framework for automated spinal lesion triage, precise localization, and explainable AI-driven diagnosis visualization in clinical workflows

Future Directions

Prospective research avenues to enhance DERNet's clinical impact and deployment.

  • ➤ Multi-center Validation: Future studies will focus on multicenter validation to facilitate generalization of the findings to a broader population, addressing the limitation of the single-center dataset.
  • ➤ Longitudinal Radiographic Surveillance: Integrating longitudinal radiographic analysis to monitor disease progression and the disease course proactively.

BibTeX

@article{DERNet2026,
  author    = {Anonymized Authors},
  title     = {A Cascaded DERNet and YOLO11 Framework for Spinal Lesion Triage and Localization with Explainable AI},
  journal   = {Under Review},
  year      = {2026},
}