Model Training Comparison

YOLOv8s × 4 runs · YOLOv8L × 1 run · RT-DETR-L × 2 runs — complete results

0.9733Best [email protected] — YOLOv8L 1000ep (peak)
0.8314Best [email protected]:0.95 — YOLOv8L 1000ep
98Lowest BG→Varroa FP — YOLOv8L 1000ep
−79%Total FP Reduction vs baseline
171RT-DETR 1000ep BG→Varroa FP
7Runs Completed

YOLOv8L Breaks Every Record

At 1000 epochs, YOLOv8L surpasses all previous runs on every metric. Most strikingly, background→varroa false positives collapsed from 237 (YOLOv8s 2000ep) to just 98 — a 59% single-run improvement and 79% reduction from the starting point. The larger model's extra capacity is clearly doing real work on the hardest class.

Metric Charts

YOLOv8s 200ep
YOLOv8s 400ep
YOLOv8s 1000ep
YOLOv8s 2000ep
YOLOv8L 1000ep
RT-DETR-L 200ep
RT-DETR-L 1000ep

Loss Curves

Val DFL Loss — a recurring pattern

Both YOLOv8s 2000ep and YOLOv8L 1000ep show val DFL loss rising after an early minimum (~ep300 for L), while train DFL continues declining. This appears to be a fundamental characteristic of this dataset with the YOLO architecture — the model learns bounding box distributions on training images that don't perfectly generalise. YOLOv8L shows the same signature but with much better overall metrics, suggesting the extra capacity compensates via better feature learning elsewhere.

Confusion Matrices (Final Epoch)

YOLOv8s 200ep 200ep
True: beeTrue: varroaTrue: bg
Pred: bee7919419
Pred: varroa13181391
Pred: bg24458
Varroa recall: 0.87 FP bg→varroa: 458
YOLOv8s 1000ep 1000ep
True: beeTrue: varroaTrue: bg
Pred: bee792123
Pred: varroa3356257
Pred: bg17287
Varroa recall: 0.92 FP bg→varroa: 257 ↓44%
YOLOv8s 2000ep 2000ep
True: beeTrue: varroaTrue: bg
Pred: bee791911
Pred: varroa3338237
Pred: bg25305
Varroa recall: 0.92 FP bg→varroa: 237 ↓48%
YOLOv8L 1000ep L 1000ep
True: beeTrue: varroaTrue: bg
Pred: bee792145
Pred: varroa332898
Pred: bg23311
Varroa recall: 0.91 FP bg→varroa: 98 ↓79%
RT-DETR-L 1000ep rtdetr 1000
True: beeTrue: varroaTrue: bg
Pred: bee718254
Pred: varroa3329171
Pred: bg762314
Varroa recall: 0.91 FP bg→varroa: 171 Miss bee→bg: 762
RT-DETR-L 200ep rtdetr
True: beeTrue: varroaTrue: bg
Pred: bee79304375
Pred: varroa23427555
Pred: bg12212
Varroa recall: 0.94 FP bg→varroa: 555 undertrained

Analysis & Observations

YOLOv8L Wins — RT-DETR Doesn't Beat It
On every headline metric, YOLOv8L 1000ep beats RT-DETR 1000ep: mAP50 0.971 vs 0.964, mAP50-95 0.831 vs 0.789, precision 0.979 vs 0.953, recall 0.950 vs 0.929. The convolutional architecture with larger capacity outperforms the transformer on this dataset.
RT-DETR Beats YOLOv8L on BG→Varroa FP
RT-DETR 1000ep gets 171 BG→varroa false positives vs YOLOv8L's 98 — so it doesn't beat the FP benchmark either. However this is a massive improvement from the 555 at 200ep, confirming it was just undertrained then. The transformer attention is doing real work on the varroa/background distinction.
RT-DETR Has a Bee Miss Problem
The big red flag: 762 bees predicted as background (bee recall ~0.90 normalised). YOLOv8L has essentially zero of this — bees are saturated across all YOLO runs. RT-DETR is trading bee detection confidence for varroa sensitivity in a way YOLO never did. The val cls_loss also spikes hard after ep820, suggesting the model started overfitting its classification head late in training.
Val Cls Loss Spike — Overfit Signal
Val cls_loss reached its minimum at epoch 820 (0.275) then rose sharply to 0.407 by ep1000. This is the RT-DETR equivalent of the YOLO val DFL divergence — the classification head is overfitting. Best mAP50 was actually at ep818 (0.969), suggesting the saved weights at that checkpoint would be stronger than the final model.
mAP50-95 Still Rising at ep1000
Like YOLOv8L, RT-DETR's mAP50-95 was still climbing at epoch 1000 (best at ep999: 0.789). Box localisation is still improving even as classification starts to overfit. This is a consistent pattern across both architectures on this dataset.
Production Candidate: YOLOv8L
YOLOv8L 1000ep is the clear winner: best mAP50, best mAP50-95, lowest FP, highest precision and recall, no bee miss problem. The best checkpoint (ep783 weights) is the strongest model in the series. RT-DETR would need architectural tuning or more data to close the gap.

ults Summary

Model / RunEpochs[email protected][email protected]:0.95PrecisionRecallBG→Varroa FPStatus
YOLOv8s 200ep 2000.94500.60790.95270.9166458
YOLOv8s 400ep 4000.95490.67350.96240.9358357
YOLOv8s 1000ep 10000.96480.74960.97330.9466257
YOLOv8s 2000ep 20000.9620 (peak 0.9658)0.77340.96800.9477237
YOLOv8L 1000ep 1000 0.9708 (peak 0.9733) 0.8314 0.9791 0.9501 98
RT-DETR-L 200ep 2000.96350.64060.94810.9436555
RT-DETR-L 1000ep 1000 0.9636 (peak 0.9686) 0.78930.95340.9287 171 (bee miss: 762)

Note on RT-DETR Loss Curves

RT-DETR uses GIoU + L1 + classification losses vs YOLOv8's box + cls + dfl. Loss values are not cross-comparable between architectures.