Model Training Comparison | Berry Jerry Bees

0.9733Best [email protected] — YOLOv8L 1000ep (peak)

0.8314Best [email protected]:0.95 — YOLOv8L 1000ep

98Lowest BG→Varroa FP — YOLOv8L 1000ep

−79%Total FP Reduction vs baseline

171RT-DETR 1000ep BG→Varroa FP

7Runs Completed

YOLOv8L Breaks Every Record

At 1000 epochs, YOLOv8L surpasses all previous runs on every metric. Most strikingly, background→varroa false positives collapsed from 237 (YOLOv8s 2000ep) to just 98 — a 59% single-run improvement and 79% reduction from the starting point. The larger model's extra capacity is clearly doing real work on the hardest class.

Metric Charts

YOLOv8s 200ep

YOLOv8s 400ep

YOLOv8s 1000ep

YOLOv8s 2000ep

YOLOv8L 1000ep

RT-DETR-L 200ep

RT-DETR-L 1000ep

Loss Curves

Val DFL Loss — a recurring pattern

Both YOLOv8s 2000ep and YOLOv8L 1000ep show val DFL loss rising after an early minimum (~ep300 for L), while train DFL continues declining. This appears to be a fundamental characteristic of this dataset with the YOLO architecture — the model learns bounding box distributions on training images that don't perfectly generalise. YOLOv8L shows the same signature but with much better overall metrics, suggesting the extra capacity compensates via better feature learning elsewhere.

Confusion Matrices (Final Epoch)

YOLOv8s 200ep 200ep

	True: bee	True: varroa	True: bg
Pred: bee	7919	4	19
Pred: varroa	1	3181	391
Pred: bg	24	458	—

Varroa recall: 0.87 FP bg→varroa: 458

YOLOv8s 1000ep 1000ep

	True: bee	True: varroa	True: bg
Pred: bee	7921	—	23
Pred: varroa	—	3356	257
Pred: bg	17	287	—

Varroa recall: 0.92 FP bg→varroa: 257 ↓44%

YOLOv8s 2000ep 2000ep

	True: bee	True: varroa	True: bg
Pred: bee	7919	—	11
Pred: varroa	—	3338	237
Pred: bg	25	305	—

Varroa recall: 0.92 FP bg→varroa: 237 ↓48%

YOLOv8L 1000ep L 1000ep

	True: bee	True: varroa	True: bg
Pred: bee	7921	4	5
Pred: varroa	—	3328	98
Pred: bg	23	311	—

Varroa recall: 0.91 FP bg→varroa: 98 ↓79%

RT-DETR-L 1000ep rtdetr 1000

	True: bee	True: varroa	True: bg
Pred: bee	7182	—	54
Pred: varroa	—	3329	171
Pred: bg	762	314	—

Varroa recall: 0.91 FP bg→varroa: 171 Miss bee→bg: 762

RT-DETR-L 200ep rtdetr

	True: bee	True: varroa	True: bg
Pred: bee	7930	4	375
Pred: varroa	2	3427	555
Pred: bg	12	212	—

Varroa recall: 0.94 FP bg→varroa: 555 undertrained

Analysis & Observations

YOLOv8L Wins — RT-DETR Doesn't Beat It

On every headline metric, YOLOv8L 1000ep beats RT-DETR 1000ep: mAP50 0.971 vs 0.964, mAP50-95 0.831 vs 0.789, precision 0.979 vs 0.953, recall 0.950 vs 0.929. The convolutional architecture with larger capacity outperforms the transformer on this dataset.

RT-DETR Beats YOLOv8L on BG→Varroa FP

RT-DETR 1000ep gets 171 BG→varroa false positives vs YOLOv8L's 98 — so it doesn't beat the FP benchmark either. However this is a massive improvement from the 555 at 200ep, confirming it was just undertrained then. The transformer attention is doing real work on the varroa/background distinction.

RT-DETR Has a Bee Miss Problem

The big red flag: 762 bees predicted as background (bee recall ~0.90 normalised). YOLOv8L has essentially zero of this — bees are saturated across all YOLO runs. RT-DETR is trading bee detection confidence for varroa sensitivity in a way YOLO never did. The val cls_loss also spikes hard after ep820, suggesting the model started overfitting its classification head late in training.

Val Cls Loss Spike — Overfit Signal

Val cls_loss reached its minimum at epoch 820 (0.275) then rose sharply to 0.407 by ep1000. This is the RT-DETR equivalent of the YOLO val DFL divergence — the classification head is overfitting. Best mAP50 was actually at ep818 (0.969), suggesting the saved weights at that checkpoint would be stronger than the final model.

mAP50-95 Still Rising at ep1000

Like YOLOv8L, RT-DETR's mAP50-95 was still climbing at epoch 1000 (best at ep999: 0.789). Box localisation is still improving even as classification starts to overfit. This is a consistent pattern across both architectures on this dataset.

Production Candidate: YOLOv8L

YOLOv8L 1000ep is the clear winner: best mAP50, best mAP50-95, lowest FP, highest precision and recall, no bee miss problem. The best checkpoint (ep783 weights) is the strongest model in the series. RT-DETR would need architectural tuning or more data to close the gap.

ults Summary

Model / Run	Epochs	[email protected]	[email protected]:0.95	Precision	Recall	BG→Varroa FP	Status
YOLOv8s 200ep	200	0.9450	0.6079	0.9527	0.9166	458
YOLOv8s 400ep	400	0.9549	0.6735	0.9624	0.9358	357
YOLOv8s 1000ep	1000	0.9648	0.7496	0.9733	0.9466	257
YOLOv8s 2000ep	2000	0.9620 (peak 0.9658)	0.7734	0.9680	0.9477	237	✅
YOLOv8L 1000ep	1000	0.9708 (peak 0.9733)	0.8314	0.9791	0.9501	98
RT-DETR-L 200ep	200	0.9635	0.6406	0.9481	0.9436	555
RT-DETR-L 1000ep	1000	0.9636 (peak 0.9686)	0.7893	0.9534	0.9287	171 (bee miss: 762)

Note on RT-DETR Loss Curves

RT-DETR uses GIoU + L1 + classification losses vs YOLOv8's box + cls + dfl. Loss values are not cross-comparable between architectures.