A Cascaded DERNet and YOLO11 Framework for Spinal Lesion Triage and Localization with Explainable AI

Anonymized Authors

Anonymized Affiliations

Paper arXiv Code Data

(a) Vertebral Collapse

(b) Osteophytes

(d) Surgical Implant

(e) Disc Space Narrowing

(f) Foraminal Stenosis

(g) Other Lesion

(h) Normal

Automated Lesion Detection and Localization: Visual demonstration of the cascaded DERNet-YOLO11 framework on VinDr-SpineXR benchmark images. The figure presents eight representative cases including seven pathological conditions with precise bounding-box localization: (a) Vertebral Collapse, (b) Osteophytes, (c) Spondylolisthesis, (d) Surgical Implants, (e) Disc Space Narrowing, (f) Foraminal Stenosis, (g) Other Lesions, alongside (h) a Normal spine radiograph for comparison. Each pathological case demonstrates the model's capability to accurately detect and spatially localize subtle lesions despite significant class imbalance (46.9:1 ratio), small object scales (often <1% of image area), and anatomical structure overlap. The localization boxes validate the clinical applicability of the integrated triage–localization pipeline.

Overview of the cascaded framework for spinal lesion triage and localization.

Abstract

Automated analysis of spinal radiographs is essential for early diagnostic triage but remains challenging due to the visual subtlety and small scale of many spinal lesions. We propose a unified cascaded deep learning framework designed to improve screening sensitivity while maintaining precise lesion localization. The diagnostic workflow is structurally decoupled into two stages. First, DERNet, a heterogeneous ensemble of EfficientNetV2-S, DenseNet121, and ResNet50, performs high-sensitivity binary triage to filter normal radiographs. Abnormal cases are subsequently routed to a customized YOLO11-L detector for fine-grained lesion localization. Experimental evaluation on the VinDrSpineXR benchmark demonstrates strong performance, achieving an AUROC of 91.03% for image-level classification and a mAP@0.5 of 40.10% for lesion-level detection. To enhance clinical interpretability, we integrate explainability techniques including LIME, Grad-CAM, and qualitative visualization, ensuring that predictions are aligned with anatomically relevant structures. The proposed framework provides an interpretable and efficient solution for automated spinal radiograph analysis. Additional resources, including extended experimental analyses, error analysis, qualitative visualizations, real-time demos, and reproducibility resources, are available at the DERNet project website.

Cascaded DERNet-YOLO11 Framework: Dual-Stage Triage-Localization Architecture

🔴 Clinical Problem

Extreme class imbalance: 46.9:1 ratio
Small lesions: <1% field-of-view (~8,800 px²)
Dual requirements: Triage + Localization
Interpretability gap: Black-box AI limitations

🟡 Clinical Workflow

Stage 1: Binary triage for abnormality screening
Stage 2: 7-class localization with bounding boxes
Explainable AI: Visual validation support
Performance: 84.91% sensitivity, 81.68% specificity

🟢 Architecture

Stage 1: DenseNet-121 + EfficientNetV2-S + ResNet-50 ensemble (91.03% AUROC)
Stage 2: YOLO11-L with CSPDarknet + PANet (40.10% mAP@0.5)
Dataset: VinDr-SpineXR (10,468 images)
Real-time: 11 FPS on RTX 3050 GPU

Figure: Cascaded DERNet-YOLO11 dual-stage architecture combining ensemble classification (Stage 1) and object detection (Stage 2) for automated spine lesion triage and localization.

Implementation Details & Experimental Configuration

Specific parameter settings for the DERNet triage threshold, YOLO11 detector, and spinal radiograph enhancement.

Threshold Optimization

Optimal Threshold $\tau^* = 0.478$
Selected via grid search to maximize F1-score on validation set.

Range: $[0.35, 0.60]$ (default 0.5)
Step size: $\Delta\tau = 0.0002$
Metric: 5-fold CV average

Method Comparison:

Method	$\tau^*$	F1 (%)
F1-Max	0.478	83.09
Youden's J	0.462	82.84
Default	0.500	82.63
Balanced Acc	0.485	83.01

Why F1-Max? More robust to class imbalance (46.9:1) than Youden or Accuracy.

YOLO11 Configuration

optimized on COCO dataset
Hyperparameters determined via combination search:

Param	Value	Rationale
$\gamma$ (Focal)	2.0	Balances easy/hard examples
$\alpha$ (Focal)	0.25	Optimal for minority foreground
$\lambda_{box}$	7.5	Box regression weight
$\lambda_{cls}$	0.5	Class loss weight

Augmentation: Copy-Paste ($\alpha=0.2$) utilized to address severe class imbalance (46.9:1).

Radiograph Preprocessing

Strategy: CLAHE enhancement to mitigate exposure variance.

$$I'(i, j) = \beta \cdot \frac{CDF_{\Omega_k}(I(i, j)) - CDF_{min}}{|\Omega_k| - CDF_{min}}$$

Param	Value	Purpose
Clip Limit	2.0	Limits contrast noise
Grid Size	$8 \times 8$	Local equalization
Scale $\beta$	255	8-bit mapping

Preprocessing ensures consistent feature extraction across 8,389 training images.

Performance Evaluation

Comprehensive analysis of the proposed DERNet model demonstrating superior accuracy and clinical relevance.

(a) LIME Explainability

LIME explainability identifies important regions for mild, moderate, and severe cirrhosis classification.

Why it matters: Confirms the model focuses on clinically meaningful structures.

(b) Grad-CAM Analysis

Grad-CAM visualizations indicating regions critical for cirrhosis stage prediction.

Why it's better: Provides transparent, clinically interpretable validation.

(c) Qualitative Visualization

Comparison of the ground truth mask and predicted mask along with error analysis for the DERNet model.

Why it's better: DERNet approach preserves local anatomical details often missed by Transformers.

Classification Ensemble Performance

Performance comparison of DERNet ensemble (DenseNet-121 + EfficientNetV2-S + ResNet-50) against individual models and baseline using 5-fold cross-validation.

Table 1: Classification Performance (5-Fold Cross-Validation)

Model	Parameters	AUROC (%)	Sensitivity (%)	Specificity (%)	F1-Score (%)	Weight
DenseNet-121	8.0M	86.93	80.39	79.32	79.55	0.42
EfficientNetV2-S	21.5M	89.44	70.80	91.12	79.34	0.32
ResNet-50	25.6M	88.88	82.72	78.13	80.15	0.26
VinDr Ensemble [2]	-	88.61	83.07	79.32	81.06	-
HealNNet [15]	-	88.84	-	-	81.20	-
DERNet Ensemble	18.3M avg	91.03	84.91	81.68	83.09	-

Detection Performance & Analysis

Evaluation of YOLO11-l for spinal lesion localization across 7 pathology types with comparison to baseline methods.

Table 2: Comparison results of different methods (mAP@0.5)

Method	LT2	LT4	LT6	LT8	LT10	LT11	LT13	mAP@0.5
Dino [19]	16.58	22.87	28.53	32.71	59.78	41.28	3.24	29.28
RetinaNet [20]	14.53	25.35	41.67	32.14	65.49	51.85	5.30	28.09
Faster R-CNN [9]	22.66	35.99	49.24	31.68	65.22	51.68	2.16	31.83
Sparse R-CNN [10]	20.09	32.67	48.16	45.32	72.20	49.30	5.41	33.15
VinDr-SpineXR [2]	21.43	27.36	34.78	41.29	62.53	43.39	4.16	33.56
EGCA-Net [18]	22.36	29.75	36.73	44.69	66.58	50.41	2.09	36.09
Ours (YOLO11-L)	26.70	41.40	40.60	54.80	74.10	51.20	2.99	40.10

(*) LT2, LT4, LT6, LT8, LT10, LT11, LT13 denotes for disc space narrowing, foraminal stenosis, osteophytes, spondylolisthesis, surgical implant, vertebral collapse and other lesions, respectively.

Table 3: Ablation Study - Component Impact Analysis

Component Removed	mAP@0.5	ΔmAP	Impact Level
Full Model (Baseline)	40.10%	-	-
- Copy-Paste Augmentation	36.2%	-3.84%	High
- Mosaic Augmentation	37.1%	-2.94%	High
- C2PSA Module (Attention)	38.5%	-1.54%	Medium
- Focal Loss	38.9%	-1.14%	Medium
- Task-Aligned Assignment	39.2%	-0.84%	Low

Table 4: Computational Performance Analysis

Model	Parameters	FLOPs	Inference Time	Training Time	GPU Memory
DenseNet-121	7.98M	5.72G	18ms (~56 FPS)	~12h (60 epochs)	3.2GB
EfficientNetV2-S	21.46M	8.40G	24ms (~42 FPS)	~15h (60 epochs)	4.8GB
ResNet-50	25.56M	11.6G	28ms (~36 FPS)	~14h (60 epochs)	5.1GB
Ensemble (Average)	18.33M	8.57G	70ms (~14 FPS)	~41h total	4.37GB avg
YOLO11-l	25.27M	164.9G	22ms (~45 FPS)	~18.5h (50 epochs)	6.2GB
Combined Pipeline	43.6M	173.47G	~92ms (~11 FPS)	~59.5h total	10.57GB peak

Error Analysis

Detailed examination of model performance through confusion matrices and analysis of misclassified instances.

(a) Confusion Matrix

The class-wise confusion matrix provides a granular breakdown of the model's diagnostic accuracy across all seven spinal lesion categories. By visualizing true positives versus misclassifications, we quantitatively assess the impact of severe class imbalance (e.g., distinguishing rare "Vertebral collapse" from frequent "Osteophytes"). The matrix reveals high diagonal density, confirming robust sensitivity even for minority classes, while highlighting specific inter-class ambiguities—such as the subtle overlap between "Disc space narrowing" and "Spondylolisthesis"—that guided our refined architectural choices.

(b) False Positive Analysis

A systematic investigation of False Positives (FPs) elucidates the model's decision boundaries in complex radiological environments. We examine instances where normal anatomical variations, high-density bone structures, or external artifacts (e.g., surgical clips) were incorrectly flagged as pathological. This analysis is critical for clinical deployment, as reducing FPs—particularly in triage settings—minimizes unnecessary follow-ups. Our findings show that the majority of FPs occur in low-contrast regions, necessitating the integration of context-aware attention mechanisms to suppress background noise.

(c) Missed Lesions Analysis

A comprehensive review of False Negatives (FNs) uncovers the morphological characteristics of spinal lesions that evade detection. We analyze 'missed' cases to identify patterns such as extreme subtlety (lesions occupying <1% of the FOV), occlusion by overlapping structures, or atypical visual presentations. Understanding these failure modes—specifically in 'Other lesions' and early-stage 'Foraminal stenosis'—drives future improvements in multi-scale feature extraction. This rigorous audit ensures transparency and targeted refinement, directly addressing the safety-critical requirement of minimizing missed diagnoses in clinical workflows.

Key Findings

Summary of major breakthroughs in automated spine pathology triage and localization.

🎯

Robust Classification

DERNet ensemble: AUROC 91.03%, Sensitivity 84.91%, Specificity 81.68%, F1-Score 83.09% via weighted fusion of DenseNet-121, EfficientNetV2-S, ResNet-50.

⚖️

Class Imbalance Mitigation

Handled 46.9:1 class imbalance using Copy-Paste augmentation and Focal Loss, achieving mAP@0.5 40.10±0.3% for 7-class detection.

🧠

Explainable AI (XAI) Integration

LIME, Grad-CAM, and Qualitative Visualization provide saliency maps and visual validation, ensuring model focuses on clinically relevant spinal regions for transparent diagnosis.

🏅

Real-Time Clinical Applicability

~11 FPS (92ms/image) on RTX 3050 GPU for real-time spine triage with 7 pathology localization and bounding-box visual explanations.

Live Clinical Deployment Interface

Real-world implementation of the cascaded DERNet and YOLO11 framework for automated spinal lesion triage, precise localization, and explainable AI-driven diagnosis visualization in clinical workflows

Future Directions

Prospective research avenues to enhance DERNet's clinical impact and deployment.

➤ Multi-center Validation: Future studies will focus on multicenter validation to facilitate generalization of the findings to a broader population, addressing the limitation of the single-center dataset.
➤ Longitudinal Radiographic Surveillance: Integrating longitudinal radiographic analysis to monitor disease progression and the disease course proactively.

BibTeX

@article{DERNet2026,
  author    = {Anonymized Authors},
  title     = {A Cascaded DERNet and YOLO11 Framework for Spinal Lesion Triage and Localization with Explainable AI},
  journal   = {Under Review},
  year      = {2026},
}