Model Training Results | Berry Jerry Bees

7Experiments Run

200Epochs Each

0.9635Best mAP50

0.6406Best mAP50-95

320pxImage Size

Phase 1: Baseline Diagnostic Results

Before tuning anything, the baseline model's outputs were analyzed to understand what the model was learning and where it was struggling.

Figure 1: Training and Validation Loss Curves (Box, Cls, DFL).

Training Convergence

The Results.png plot shows healthy convergence. Both training and validation losses — Box, Class, and DFL — decrease steadily, confirming the model is learning without immediate overfitting.

Box Loss: How accurately the model pinpoints the mite's location.
Cls Loss: Accuracy of the Bee vs. Varroa classification.
DFL Loss: Refines bounding box edges for small, hard-to-distinguish objects.

The "Hallucination" Factor

The Confusion Matrix highlights our primary challenge: a 56% background-to-varroa error rate. The model frequently mistakes bee anatomy or shadows for mites — this is the core problem the later experiments address.

Precision-Recall Tradeoff

The PR Curve reveals the gap between classes. Bee detection is near-perfect (0.96 mAP), while the Varroa curve (0.63 mAP) drops sharply — the model struggles to maintain accuracy as it tries to find more mites.

Class Imbalance & The P2 Plan

Labels.jpg confirms the ~4:1 bee-to-varroa class imbalance visible in the dataset. In object detection, this can cause the model to favor the "easier" majority class (Bees) while neglecting the "harder" minority class (Varroa).

Why P2? Because varroa mites are often less than 10 pixels wide. A P2 detection head analyses at stride 4 (higher resolution) to ensure tiny features aren't lost in the feature pyramid.

Figure 4: Spatial distribution and instance counts of Bee vs. Varroa labels.

Key Findings Across All Experiments

RT-DETR leads overall

RT-DETR achieves mAP50=0.9635 and mAP50-95=0.6406 — outperforming the best YOLOv8s run by +0.018 and +0.033 respectively, with recall up +0.027. The transformer architecture generalises better on classification without needing the cls=0.05 fix.

All runs still converging at epoch 200

Best epochs occur at 196–200 across every experiment. Extended training with early stopping is the clear next step — particularly for RT-DETR, whose mAP50-95 curve shows no sign of plateauing.

cls=0.05 is the strongest single YOLO change

Reducing classification loss weight from 2.0 to 0.05 consistently improved mAP50-95 and eliminated the rising val cls loss seen in cls=2.0 runs after epoch 50 — a clear sign of overfitting on classification. Mixup augmentation showed no consistent benefit across either dataset scale.

2 datasets outperform 124 datasets

All 2-dataset runs score ~0.94+ mAP50 vs ~0.86–0.89 for 124-dataset runs. The gap points to quality differences or distribution mismatch in the extended dataset pool — consistent with the findings from the Debug Saga.

Training Curves

All 7 runs plotted over 200 epochs. Hover for per-run values.

Experiment Legend

mAP50 — Validation

mAP50-95 — Validation

Val Classification Loss

Val Box / GIoU Loss

Note on Val Box / GIoU Loss

RT-DETR uses GIoU loss for bounding box regression while YOLOv8s uses CIoU-based box loss. The two losses operate on different scales (~0.28–0.41 vs ~1.02–1.24) and are not directly comparable — treat them as independent convergence indicators per architecture.

Final Epoch Comparison

All runs sorted by mAP50-95 (best first).

RT-DETR vs Best YOLOv8s

Head-to-head: RT-DETR · 2ds versus YOLOv8s cls=0.05 · 2ds (best YOLO configuration).

RT-DETR · 2ds

mAP500.9635

mAP50-950.6406

Precision0.9480

Recall0.9436

Val cls loss0.3320

YOLOv8s cls0.05 · 2ds

mAP500.9452

mAP50-950.6079

Precision0.9526

Recall0.9166

Val cls loss0.3507

Δ (RT-DETR − YOLO)

mAP50+0.0183

mAP50-95+0.0327

Precision−0.0046

Recall+0.0270

Val cls loss−0.0187

Next Steps

Recommended next experiments

Extend training to 300–400 epochs with early stopping — all runs peaked at epoch 200 with no plateau
Test cls=0.05 on RT-DETR — the transformer already achieves low cls loss by default; confirm whether the fix still applies
Add P2 detection head (stride 4) — varroa instances are often <10px; the P2 head analyses at higher resolution to recover these
Investigate 2ds vs 124ds gap — audit 124-dataset sources for labelling quality and distribution mismatch
Edge deployment evaluation — benchmark RT-DETR and YOLOv8s inference speed on target hive hardware

Scientific References

[1] Bilik, S., et al. (2021). Visual Diagnosis of the Varroa Destructor Parasitic Mite in Honeybees. Sensors.

[2] Ultralytics Documentation. Performance Metrics Deep Dive: Interpreting PR Curves and mAP.

[3] Jocher, G., et al. (2023). YOLOv8: Real-Time Object Detection and Architectural Evolution.