The complete story of building a machine learning varroa detection system, from datasets to deployed model.
How we use computer vision and ML to monitor our hives 24/7 for varroa mites — the biggest threat to honeybee colonies worldwide
So far, I'm fortunate to report zero varroa detections across all six hives. This system wasn't built in response to an infestation—it's a proactive, passion project designed to catch problems before they start. Think of it as an early warning system that complements my existing monitoring methods, particularly alcohol washes.
The goal was never just to detect varroa after it appears, but to build a continuous monitoring system that gives me peace of mind and catches the first signs of trouble weeks earlier than traditional methods alone.
Varroa mites are parasitic mites that attach to honeybees and feed on their hemolymph (bee blood). Left unchecked, they can devastate an entire colony within months. Early detection is absolutely critical—but manually inspecting thousands of bees is time-consuming and often catches infestations too late.
Traditional monitoring methods involve sticky boards, alcohol washes, or visual inspections during hive checks. These are invasive, labor-intensive, and only give you a snapshot of a single moment in time. We needed something better.
Here's what the system looks like in practice, from the mites we're detecting to the alerts in action.
Each hive has its own monitoring station, completely solar-powered and weatherproof. Here's what goes into each setup:
The cameras are mounted directly above the hive entrance, pointing down at a 45° angle. This captures bees as they enter and exit, giving us clear views of their backs—where varroa mites typically attach. The 4K resolution is essential because varroa mites are only 1-2mm in size.
I used YOLOv11n (YOLO "nano" - the lightweight version) for object detection. Here's how I trained it:
Found approximately 7,000 labeled images of varroa-infected bees on Ultralytics' open dataset. These images show bees with clearly visible varroa mites attached, along with bounding box annotations marking exactly where the mites are located.
Since I was working with two separate datasets (V1 and V2), I needed to merge them into a single master dataset. Here's the Python script I used to combine the validation sets while preventing filename conflicts:
import os
import shutil
from pathlib import Path
# Paths to source datasets
sources = [
Path(r'C:\\Users\\teapot\\Documents\\Projects\\VarroaDetection\\datasets\\V1'),
Path(r'C:\\Users\\teapot\\Documents\\Projects\\VarroaDetection\\datasets\\V2')
]
# Master dataset directories
master_val_img = Path(r'C:\\...\\master_dataset\\val\\images')
master_val_lbl = Path(r'C:\\...\\master_dataset\\val\\labels')
# Create master directories
master_val_img.mkdir(parents=True, exist_ok=True)
master_val_lbl.mkdir(parents=True, exist_ok=True)
# Common names for validation sets
val_names = ['val', 'valid', 'validation', 'test']
for src in sources:
found_in_src = False
for v_name in val_names:
img_dir = src / v_name / 'images'
lbl_dir = src / v_name / 'labels'
if img_dir.exists():
print(f"Found data in: {img_dir}")
found_in_src = True
for file in img_dir.iterdir():
if file.suffix.lower() in ['.jpg', '.jpeg', '.png']:
# Use source folder name as prefix to prevent conflicts
prefix = f"{src.name}_{v_name}_"
shutil.copy2(file, master_val_img / f"{prefix}{file.name}")
# Copy corresponding label file
label_file = lbl_dir / f"{file.stem}.txt"
if label_file.exists():
shutil.copy2(label_file, master_val_lbl / f"{prefix}{label_file.name}")
if not found_in_src:
print(f"!!! Warning: No validation folders found in {src}")
print(f"\\nMerge complete. Total images: {len(list(master_val_img.glob('*')))}")
This script intelligently merges datasets by adding prefixes (like "V1_val_" or "V2_val_") to filenames, preventing any overwrites. It also handles different naming conventions for validation folders and ensures that both images and their corresponding label files are copied together.
Chose YOLOv11n because:
Epochs: 100 Batch size: 16 Image size: 640x640 Augmentation: Enabled (flips, rotations, brightness adjustments) Validation split: 20% Hardware: NVIDIA GPU (local training)
Here's the complete training script I used. Running this on a laptop with an NVIDIA GPU took approximately 2-3 hours to complete all 100 epochs:
from ultralytics import YOLO
if __name__ == '__main__':
# Load the YOLOv11 nano model (optimized for speed)
model = YOLO('yolo11n.pt')
# Start training
model.train(
data=r'C:\\Users\\teapot\\Documents\\Projects\\VarroaDetection\\datasets\\master_dataset\\data.yaml',
epochs=100,
imgsz=640,
batch=16, # Adjusted for laptop GPU memory
device=0, # Use NVIDIA GPU (0 = first GPU)
workers=4, # Data loading threads
name='varroa_yolo11n_100epochs'
)
Key parameters explained:
GPU Memory Issues? If you get "CUDA out of memory" errors, reduce the batch size to 8 or even 4. Training will take longer but won't crash.
No GPU? Set device='cpu' in the training script. It'll be much slower (10-20x),
but it works. Consider using Google Colab's free GPU for faster training.
Monitoring Training: YOLO automatically saves training curves, metrics, and model checkpoints
to runs/detect/varroa_yolo11n_100epochs/ after each epoch. Check these to see how your model is
improving!
After training, the model achieved:
After 100 epochs of training, here's how the YOLOv11n model performed. These metrics help us understand not just whether the model works, but how well it works and where it might need improvement.
The results chart shows multiple key metrics tracked during training. The top row shows three types of loss (box, class, and DFL) decreasing over time—this means the model is learning. The bottom row shows validation losses, which tell us the model isn't just memorizing the training data but can generalize to new images.
The right side of the training results shows the most important numbers:
With 94% precision and 91% recall, this model strikes a good balance. High precision means I won't get flooded with false alarms about varroa that isn't there. High recall means the system won't miss many real infestations. For a beekeeping application where early detection is critical, this is exactly the balance we want.
This curve is crucial for tuning the system. It shows that the "bee" class (blue) achieves near-perfect precision very quickly, while "varroa" (orange) requires moderate confidence (around 0.4-0.5) to reach peak precision. The steep rise in the varroa curve means the model quickly becomes confident when it detects a mite—exactly what we want.
This is the classic machine learning trade-off curve. For varroa detection, we achieved an mAP of 87.2%, meaning the model maintains high precision across a wide range of recall values. The "all classes" curve (blue) shows an overall [email protected] of 93.3%—a strong result indicating the model generalizes well.
This curve shows how many varroa mites we catch at different confidence thresholds. At near-zero confidence, we catch 92% of all mites (high recall). But as we increase the threshold to reduce false positives, recall drops. The sweet spot for this system is around 0.5-0.7 confidence, where we still catch most mites while filtering out obvious false detections.
The confusion matrices reveal the model's strengths and weaknesses:
The normalized matrix shows that 96% of varroa predictions are classified as varroa or background, with very few confused as bees. This is ideal because it means the primary source of error is missing mites (false negatives) rather than hallucinating them (false positives). For a monitoring system, it's better to miss a few mites and catch them later than to constantly trigger false alarms.
This diagnostic chart reveals important characteristics of the training dataset:
The small, centered distribution of varroa bounding boxes explains why high-resolution 4K cameras are essential— these mites occupy only a tiny fraction of each image frame.
Based on these results, I've configured the production system with a confidence threshold of 0.70. This means the model must be at least 70% confident before triggering an alert. At this threshold:
The confusion matrix confirms this is the right approach: most errors are missed detections (which get caught in the next photo 30 minutes later) rather than false alarms (which would erode trust in the system).
Here's how the entire system works, from camera to inbox:
graph TD
A[4K Camera] -->|Every 30 min| B[Raspberry Pi]
B -->|Capture Image| C[Local Storage]
C -->|Upload via WiFi| D[Server Directory]
D -->|Cron Job Every 30min| E[Analysis Script]
E -->|Load Image| F[YOLOv11n Model]
F -->|Run Inference| G[Varroa Detected?]
G -->|No| H[Log and Continue]
G -->|Yes| I[Generate Alert]
I -->|Email| J[Beekeeper Inbox]
I -->|Include| K[Annotated Image]
I -->|Include| L[Hive Identifier]
I -->|Include| M[Detection Confidence]
J -->|Manual Review| N[Inspect Hive]
A cron job on each Raspberry Pi triggers the camera every 30 minutes. The Pi captures a 4K image of the hive entrance, stamps it with a timestamp and hive ID, and saves it locally.
*/30 * * * * /home/pi/capture_image.sh
Each Raspberry Pi then uploads the image to a central server directory via rsync or SFTP. The filename includes the hive name and timestamp for easy identification.
padme_2026-02-08_0830.jpg galadriel_2026-02-08_0830.jpg frigga_2026-02-08_0830.jpg
A separate cron job on the server runs every 30 minutes, scanning the upload directory for new images. For each new image:
for each new_image in upload_directory:
results = model.predict(new_image)
if results.detections.confidence > 0.70:
annotated_image = draw_boxes(new_image, results)
hive_name = extract_hive_from_filename(new_image)
timestamp = extract_timestamp(new_image)
send_alert_email(
to="[email protected]",
subject=f" Varroa Alert: [hive_name]",
body=f"Varroa detected at [timestamp]",
attachment=annotated_image
)
log_detection(hive_name, timestamp, confidence)
When varroa is detected, I receive an email with:
Email is simple, reliable, and doesn't require maintaining a separate dashboard or app. I get push notifications on my phone, can view the image immediately, and have a searchable history of all detections. Plus, I can forward alerts to other beekeepers if I'm away.
Once I receive an alert, I:
Here's a simplified view of how data moves through the system:
sequenceDiagram
participant C as Camera
participant RPi as Raspberry Pi
participant S as Server
participant M as ML Model
participant B as Beekeeper
C->>RPi: Capture image (every 30min)
RPi->>RPi: Save locally
RPi->>S: Upload via WiFi
S->>S: New images detected
S->>M: Run inference
M->>M: Analyze for varroa
alt Varroa detected
M->>S: Return detections
S->>B: Send email alert
S->>S: Log event
else No varroa
M->>S: No detections
S->>S: Log clear result
end
One of the biggest challenges with the current system is false positives—pollen, shadows, and debris can sometimes be mistaken for varroa mites. The next major improvement I'm working on is using K-means clustering to analyze the color profile of detected objects.
# After YOLO detects potential varroa
if yolo_detection.confidence > 0.70:
# Extract the detected region
cropped_region = image[bbox.y1:bbox.y2, bbox.x1:bbox.x2]
# Apply K-means clustering (k=3 to find dominant colors)
kmeans = KMeans(n_clusters=3)
colors = kmeans.fit(cropped_region.reshape(-1, 3))
dominant_color = colors.cluster_centers_[0]
# Check if dominant color matches varroa profile
varroa_color_range = ([120, 50, 10], [160, 90, 40]) # RGB range
if is_color_in_range(dominant_color, varroa_color_range):
confidence_boost = 1.2 # Increase confidence
send_alert()
else:
# Likely false positive - log but don't alert
log_potential_false_positive()
This approach should dramatically reduce false positives from yellow pollen (which clusters toward RGB ~255, 200, 0) and dark shadows (which cluster toward low RGB values). By combining computer vision object detection with color analysis, we get a much more robust detection system.
graph LR
A[Hardware Layer] --> B[Raspberry Pi 4]
A --> C[4K Camera Module]
A --> D[Solar Panel System]
E[Software Layer] --> F[Python 3.11]
E --> G[Ultralytics YOLOv11]
E --> H[OpenCV]
E --> I[PIL/Pillow]
J[Infrastructure] --> K[Linux Server]
J --> L[Cron Jobs]
J --> M[SMTP Email]
J --> N[rsync/SFTP]
O[ML Pipeline] --> P[Model: YOLOv11n]
O --> Q[Dataset: ~7000 images]
O --> R[Framework: PyTorch]
I'm happy to share more details about the setup, code, or answer questions from fellow beekeepers interested in implementing similar systems. This kind of technology could be game-changing for small-scale beekeepers who don't have time for constant manual monitoring.
Get in touch: [email protected]
Building a production-ready varroa detection model from multiple datasets — the challenges, solutions, and lessons learned
Training an accurate object detection model requires high-quality, consistently labeled data. My journey began with four datasets from different sources, each with its own labeling conventions.
| Dataset | Train | Validation | Test | Label Format |
|---|---|---|---|---|
| V1 | 6,075 | 1,737 | 868 | Bee=0, Varroa=1 ✓ |
| V2 | 8,217 | 1,867 | 3,408 | Bee=0, Varroa=1 ✓ |
| V3 | 8,093 | 1,175 | 468 | Bee=0, Varroa=1 ✓ |
| V4 | 5,144 | 0 | 0 | Varroa=0, Bee=1 |
Dataset V4 used the opposite labeling convention. It labeled varroa as class 0 and bees as class 1, while all other datasets and my standard convention used class 0 for bees and class 1 for varroa. This required a label conversion step before merging.
My first attempt to fix the V4 labels had a critical bug:
mapping = {'0': '1', '3': '0'} # Wrong approach!
for line in lines:
parts = line.strip().split()
old_id = parts[0]
if old_id in mapping:
parts[0] = mapping[old_id] # Bug: Sequential mapping causes collision>
Before: 5,851 bees (class 0) + 5,235 varroa (class 1)
After bug: 0 bees + 11,086 varroa — everything merged into class 1!
The script converted all class 0 → class 1 first, then had nothing left to convert the other direction. Bee labels were completely lost.
for line in lines:
parts = line.strip().split()
if not parts:
continue
old_id = parts[0]
if old_id == '0':
parts[0] = '1' # varroa (was 0) → 1
modified = True
elif old_id == '1':
parts[0] = '0' # bee (was 1) → 0
modified = True
All 5,144 V4 files correctly converted with bee=0, varroa=1. Both classes preserved.
| Split | Bees (Class 0) | Varroa (Class 1) | Ratio |
|---|---|---|---|
| Training | 55,287 | 13,397 | 4.13:1 |
| Validation | 14,343 | 4,010 | 3.58:1 |
| Total | 69,630 | 17,407 | 4.00:1 |
The 4:1 bee-to-varroa ratio is a significant but realistic class imbalance — in real hive frames, bees genuinely outnumber varroa mites. The strategy was to start with default training and apply class weighting only if varroa detection performance proved insufficient.
Understanding native image resolution is critical for choosing the right imgsz parameter. Upscaling
low-resolution images creates artificial detail and slows training without improving accuracy.
Using common YOLO defaults like imgsz=640 would upscale most images by 4×, creating interpolation
artifacts and teaching the model from synthetic detail rather than real image content.
| imgsz | Upscaled Images | Upscale Factor | Assessment |
|---|---|---|---|
| 160 | 0% | 1.0× | No upscaling |
| 320 | 79.8% | 2.0× | Manageable |
| 512 | 79.8% | 3.2× | Significant upscaling |
| 640 | 97.9% | 4.0× | Excessive upscaling |
Small objects (<16px) are notoriously difficult for YOLO to detect. This analysis determines how large varroa
bounding boxes appear at different imgsz settings.
| imgsz | Median Box Size | 10th Percentile | Boxes <16px | Assessment |
|---|---|---|---|---|
| 160 | 23.5px | 15.0px | 13.3% | Too many tiny boxes |
| 320 | 46.9px | 30.0px | 0.7% | Optimal balance |
| 512 | 75.1px | 48.0px | 0.3% | Good but more upscaling |
| 640 | 93.9px | 60.0px | 0.1% | Excessive upscaling |
| Factor | YOLOv8s | YOLOv11s |
|---|---|---|
| Maturity | Battle-tested | Relatively new |
| Documentation | Extensive | Growing |
| Small Object Detection | Proven excellent | Similar/slightly better |
| Inference Speed | Fast | 10–15% faster |
| mAP Performance | Excellent | 1–2% better |
| Best For | Production reliability | Research |
For a production varroa detection system, stability and proven performance outweigh marginal speed improvements. YOLOv8s handles 30–90px objects excellently and has a mature ecosystem.
from ultralytics import YOLO
model = YOLO('yolov8s.pt')
results = model.train(
data='varroa.yaml',
epochs=100,
imgsz=320,
batch=16,
patience=20,
project='varroa_detection',
name='baseline_v8s_imgsz320'
)
imgsz=640 without analyzing native resolutionsHow I discovered that more data isn't always better, and sometimes your best friend accidentally gives you the worst dataset
Date: February 2026
Mission: Train a YOLOv8 model to detect varroa mites on bees
Expected: More data → Better performance
Reality: More data → Performance catastrophe
Coffee consumed: Way too much
This is the complete story of how I went from 93.3% mAP to 79.1% mAP by adding "trusted" datasets, tore my hair out debugging configs for hours, and finally discovered the silent killer: incomplete annotations masquerading as good data. Then fixed it all with a single hyperparameter change.
I had four datasets from Ultralytics, all seemingly legitimate. The numbers told a different story:
| Dataset | Images | Bees | Varroa | Bee:Varroa Ratio | Status |
|---|---|---|---|---|---|
| V1 | 8,680 | 8,726 | 5,149 | 1.69:1 | Good |
| V2 | 13,492 | 13,551 | 5,149 | 2.63:1 | Good |
| V3 | 9,736 | 42,118 | 1,258 | 33.48:1 | Poisoned |
| V4 | 5,144 | 5,235 | 5,851 | 0.89:1 | Label conflict |
V3 had only 0.13 varroa per image compared to V1's 0.59 and V2's 0.38. This wasn't just imbalanced — it was suspiciously low, strongly suggesting systematic under-labeling of varroa in V3.
Model: YOLOv8s, 50 epochs, cls=0.5
Results: 93.3% mAP, 87.2% varroa detection
Excellent! Everything working beautifully — high confidence, low false positives, stable training curves.
Model: YOLOv8s, 50 epochs, cls=0.5
Results: 89.1% mAP, 80.5% varroa detection
−4.2% mAP, −6.7% varroa — I added 45% more training data and performance decreased. My reaction: "Must be a config problem. Let me try different learning rates..."
Model: YOLO11s, 200 epochs, cls=2.0, mixup=0.15
Dataset: V1 + V2 + V3
Results: 91.5% mAP, 85.3% varroa detection
Better than YOLOv8s with the same data, but still worse than the baseline. 4× more training and a newer architecture couldn't fix bad data.
Model: YOLOv8s, 50 epochs, cls=0.5
Results: 79.1% mAP, 61.0% varroa detection
−14.2% mAP, −26.2% varroa from baseline. 36% false negatives — missing more than 1 in 3 varroa mites!
"Something is very wrong. Time to stop changing configs and actually look at the data."
V1+V2: 80.6% varroa detected V1+V2+V3: 75.8% varroa detected (-4.8%) V1+V2+V3+V4: 64.0% varroa detected (-16.6%)
V1+V2: 758 varroa missed (19%) V1+V2+V3: 1,049 varroa missed (24%) V1+V2+V3+V4: 1,272 varroa missed (36%)
Why: V3 had thousands of unlabeled varroa. The model learned to call varroa "background" because that's what the data told it to do.
YOLOv8s, 50 epochs, V1+V2: 93.3% mAP (simple config, 2 datasets) YOLO11s, 200 epochs, V1+V2+V3: 91.5% mAP (4x training, better model, 3 datasets)
The more data I added, the worse the model performed. V3 was actively teaching the model the wrong thing — that unlabeled varroa instances are just background.
"More data should always help — unless the new data is actively teaching the model the wrong thing."
Before merging datasets, run these checks: class distribution, instances per image ratio, bounding box size distributions, and a visual spot-check of 50+ random images. Five minutes of data auditing saves days of debugging.
When performance degrades unexpectedly, the cause is usually: 70% data quality issues, 20% implementation bugs, 10% hyperparameter problems. Not the other way around. I wasted hours tweaking configs when the data was the problem all along.
These datasets came from Ultralytics — a highly reputable source. But V3 was likely created for a different purpose (negative mining?) and had systematic under-labeling. Trust, but verify. Always.
22k clean images → 93.3% mAP 37k mixed images → 79.1% mAP
After removing V3 from the dataset, I made a single hyperparameter change and trained for 200 epochs. The results exceeded all expectations.
Varroa detection improved from 87.2% → 93.6% (+6.4 percentage points)
In production terms: Out of every 100 varroa mites, 93 are now caught instead of 81. That's 12 more per 100 — meaningful for hive health.
Increasing cls from 0.5 to 2.0 made classification errors 4× more expensive
during training, forcing the model to learn sharper decision boundaries between bees and varroa. The model
became much more confident at distinguishing varroa from background.
Training for 200 epochs allowed the model to fully converge and extract maximum performance from the clean V1+V2 dataset. Smooth, stable learning curves with no overfitting — the model plateaued around epoch 150.
Training exclusively on V1+V2 — with a healthy 2.16:1 bee:varroa ratio — provided consistent, reliable training signals with no conflicting annotations from poisoned sources.
| Model | Architecture | Dataset | Config | mAP | Varroa mAP | Status |
|---|---|---|---|---|---|---|
| Final Champion | YOLOv8s | V1+V2 | cls=2.0, 200ep | 96.5% | 93.6% | DEPLOYED |
| Baseline | YOLOv8s | V1+V2 | cls=0.5, 50ep | 93.3% | 87.2% | Good |
| YOLO11s Attempt | YOLO11s | V1+V2+V3 | cls=2.0, 200ep | 91.5% | 85.3% | V3 poisoned |
| 3-Dataset Trial | YOLOv8s | V1+V2+V3 | cls=0.5, 50ep | 89.1% | 80.5% | V3 poisoned |
| 4-Dataset Disaster | YOLOv8s | V1+V2+V3+V4 | cls=0.5, 50ep | 79.1% | 61.0% | Unusable |
"I spent hours tweaking learning rates, batch sizes, architectures, and augmentations — convinced it was a config problem. It wasn't. It was a data problem. Always check your data first. Always."
"That was a pain in the arse."
Systematic evaluation across YOLOv8s hyperparameter configurations and RT-DETR architecture — imgsz=320, 200 epochs, clean 2-dataset vs expanded 124-dataset pool
Before tuning anything, the baseline model's outputs were analyzed to understand what the model was learning and where it was struggling.
Figure 1: Training and Validation Loss Curves (Box, Cls, DFL).
The Results.png plot shows healthy convergence. Both training and validation losses — Box, Class, and DFL — decrease steadily, confirming the model is learning without immediate overfitting.
The Confusion Matrix highlights our primary challenge: a 56% background-to-varroa error rate. The model frequently mistakes bee anatomy or shadows for mites — this is the core problem the later experiments address.
The PR Curve reveals the gap between classes. Bee detection is near-perfect (0.96 mAP), while the Varroa curve (0.63 mAP) drops sharply — the model struggles to maintain accuracy as it tries to find more mites.
Labels.jpg confirms the ~4:1 bee-to-varroa class imbalance visible in the dataset. In object detection, this can cause the model to favor the "easier" majority class (Bees) while neglecting the "harder" minority class (Varroa).
Figure 4: Spatial distribution and instance counts of Bee vs. Varroa labels.
RT-DETR achieves mAP50=0.9635 and mAP50-95=0.6406 — outperforming the best YOLOv8s run by
+0.018 and +0.033 respectively, with recall up +0.027.
The transformer architecture generalises better on classification without needing the cls=0.05 fix.
Best epochs occur at 196–200 across every experiment. Extended training with early stopping is the clear next step — particularly for RT-DETR, whose mAP50-95 curve shows no sign of plateauing.
Reducing classification loss weight from 2.0 to 0.05 consistently improved mAP50-95 and eliminated the rising
val cls loss seen in cls=2.0 runs after epoch 50 — a clear sign of overfitting on classification.
Mixup augmentation showed no consistent benefit across either dataset scale.
All 2-dataset runs score ~0.94+ mAP50 vs ~0.86–0.89 for 124-dataset runs. The gap points to quality differences or distribution mismatch in the extended dataset pool — consistent with the findings from the Debug Saga.
All 7 runs plotted over 200 epochs. Hover for per-run values.
mAP50 — Validation
mAP50-95 — Validation
Val Classification Loss
Val Box / GIoU Loss
RT-DETR uses GIoU loss for bounding box regression while YOLOv8s uses CIoU-based box loss. The two losses operate on different scales (~0.28–0.41 vs ~1.02–1.24) and are not directly comparable — treat them as independent convergence indicators per architecture.
All runs sorted by mAP50-95 (best first).
Head-to-head: RT-DETR · 2ds versus YOLOv8s cls=0.05 · 2ds (best YOLO configuration).
RT-DETR · 2ds
YOLOv8s cls0.05 · 2ds
Δ (RT-DETR − YOLO)
[1] Bilik, S., et al. (2021). Visual Diagnosis of the Varroa Destructor Parasitic Mite in Honeybees. Sensors.
[2] Ultralytics Documentation. Performance Metrics Deep Dive: Interpreting PR Curves and mAP.
[3] Jocher, G., et al. (2023). YOLOv8: Real-Time Object Detection and Architectural Evolution.
YOLOv8s × 4 runs · YOLOv8L × 1 run · RT-DETR-L × 2 runs — complete epoch-by-epoch results
At 1000 epochs, YOLOv8L surpasses all previous runs on every metric. Most strikingly, background→varroa false positives collapsed from 237 (YOLOv8s 2000ep) to just 98 — a 59% single-run improvement and 79% reduction from the starting point. The larger model's extra capacity is clearly doing real work on the hardest class.
Both YOLOv8s 2000ep and YOLOv8L 1000ep show val DFL loss rising after an early minimum (~ep300 for L), while train DFL continues declining. This appears to be a fundamental characteristic of this dataset with the YOLO architecture — the model learns bounding box distributions on training images that don't perfectly generalise. YOLOv8L shows the same signature but with much better overall metrics, suggesting the extra capacity compensates via better feature learning elsewhere.
| True: bee | True: varroa | True: bg | |
|---|---|---|---|
| Pred: bee | 7919 | 4 | 19 |
| Pred: varroa | 1 | 3181 | 391 |
| Pred: bg | 24 | 458 | — |
| True: bee | True: varroa | True: bg | |
|---|---|---|---|
| Pred: bee | 7921 | — | 23 |
| Pred: varroa | — | 3356 | 257 |
| Pred: bg | 17 | 287 | — |
| True: bee | True: varroa | True: bg | |
|---|---|---|---|
| Pred: bee | 7919 | — | 11 |
| Pred: varroa | — | 3338 | 237 |
| Pred: bg | 25 | 305 | — |
| True: bee | True: varroa | True: bg | |
|---|---|---|---|
| Pred: bee | 7921 | 4 | 5 |
| Pred: varroa | — | 3328 | 98 |
| Pred: bg | 23 | 311 | — |
| True: bee | True: varroa | True: bg | |
|---|---|---|---|
| Pred: bee | 7182 | — | 54 |
| Pred: varroa | — | 3329 | 171 |
| Pred: bg | 762 | 314 | — |
| True: bee | True: varroa | True: bg | |
|---|---|---|---|
| Pred: bee | 7930 | 4 | 375 |
| Pred: varroa | 2 | 3427 | 555 |
| Pred: bg | 12 | 212 | — |
| Model / Run | Epochs | [email protected] | [email protected]:0.95 | Precision | Recall | BG→Varroa FP | Status |
|---|---|---|---|---|---|---|---|
| YOLOv8s 200ep | 200 | 0.9450 | 0.6079 | 0.9527 | 0.9166 | 458 | |
| YOLOv8s 400ep | 400 | 0.9549 | 0.6735 | 0.9624 | 0.9358 | 357 | |
| YOLOv8s 1000ep | 1000 | 0.9648 | 0.7496 | 0.9733 | 0.9466 | 257 | |
| YOLOv8s 2000ep | 2000 | 0.9620 (peak 0.9658) | 0.7734 | 0.9680 | 0.9477 | 237 | |
| YOLOv8L 1000ep | 1000 | 0.9708 (peak 0.9733) | 0.8314 | 0.9791 | 0.9501 | 98 | |
| RT-DETR-L 200ep | 200 | 0.9635 | 0.6406 | 0.9481 | 0.9436 | 555 | |
| RT-DETR-L 1000ep | 1000 | 0.9636 (peak 0.9686) | 0.7893 | 0.9534 | 0.9287 | 171 (bee miss: 762) |
RT-DETR uses GIoU + L1 + classification losses vs YOLOv8's box + cls + dfl. Loss values are not cross-comparable between architectures.