Date: February 2026
Mission: Train a YOLOv8 model to detect varroa mites on bees
Expected: More data → Better performance
Reality: More data → Performance catastrophe
Coffee consumed: Way too much
This is the complete story of how I went from 93.3% mAP to 79.1% mAP by adding "trusted" datasets, tore my hair out debugging configs for hours, and finally discovered the silent killer: incomplete annotations masquerading as good data. Then fixed it all with a single hyperparameter change.
The Starting Point: The Datasets
I had four datasets from Ultralytics, all seemingly legitimate. The numbers told a different story:
| Dataset | Images | Bees | Varroa | Bee:Varroa Ratio | Status |
|---|---|---|---|---|---|
| V1 | 8,680 | 8,726 | 5,149 | 1.69:1 | Good |
| V2 | 13,492 | 13,551 | 5,149 | 2.63:1 | Good |
| V3 | 9,736 | 42,118 | 1,258 | 33.48:1 | Poisoned |
| V4 | 5,144 | 5,235 | 5,851 | 0.89:1 | Label conflict |
The Smoking Gun
V3 had only 0.13 varroa per image compared to V1's 0.59 and V2's 0.38. This wasn't just imbalanced — it was suspiciously low, strongly suggesting systematic under-labeling of varroa in V3.
The Training Journey
Stage 1: The Baseline (V1 + V2)
Model: YOLOv8s, 50 epochs, cls=0.5
Results: 93.3% mAP, 87.2% varroa detection
Excellent! Everything working beautifully — high confidence, low false positives, stable training curves.
Stage 2: Adding V3 (V1 + V2 + V3)
Model: YOLOv8s, 50 epochs, cls=0.5
Results: 89.1% mAP, 80.5% varroa detection
−4.2% mAP, −6.7% varroa — I added 45% more training data and performance decreased. My reaction: "Must be a config problem. Let me try different learning rates..."
Stage 3: Trying YOLO11s (Desperation)
Model: YOLO11s, 200 epochs, cls=2.0, mixup=0.15
Dataset: V1 + V2 + V3
Results: 91.5% mAP, 85.3% varroa detection
Better than YOLOv8s with the same data, but still worse than the baseline. 4× more training and a newer architecture couldn't fix bad data.
Stage 4: The Disaster (V1 + V2 + V3 + V4)
Model: YOLOv8s, 50 epochs, cls=0.5
Results: 79.1% mAP, 61.0% varroa detection
−14.2% mAP, −26.2% varroa from baseline. 36% false negatives — missing more than 1 in 3 varroa mites!
"Something is very wrong. Time to stop changing configs and actually look at the data."
The Investigation
V1 + V2 BASELINE
V1 + V2 + V3
V1 + V2 + V3 + V4
V1+V2: 80.6% varroa detected V1+V2+V3: 75.8% varroa detected (-4.8%) V1+V2+V3+V4: 64.0% varroa detected (-16.6%)
V1+V2: 758 varroa missed (19%) V1+V2+V3: 1,049 varroa missed (24%) V1+V2+V3+V4: 1,272 varroa missed (36%)
Why: V3 had thousands of unlabeled varroa. The model learned to call varroa "background" because that's what the data told it to do.
YOLOv8s, 50 epochs, V1+V2: 93.3% mAP (simple config, 2 datasets) YOLO11s, 200 epochs, V1+V2+V3: 91.5% mAP (4x training, better model, 3 datasets)
The Pattern
The more data I added, the worse the model performed. V3 was actively teaching the model the wrong thing — that unlabeled varroa instances are just background.
"More data should always help — unless the new data is actively teaching the model the wrong thing."
Lessons Learned
Lesson 1: Always Audit New Datasets
Before merging datasets, run these checks: class distribution, instances per image ratio, bounding box size distributions, and a visual spot-check of 50+ random images. Five minutes of data auditing saves days of debugging.
Lesson 2: Trust Your Gut When Performance Drops
When performance degrades unexpectedly, the cause is usually: 70% data quality issues, 20% implementation bugs, 10% hyperparameter problems. Not the other way around. I wasted hours tweaking configs when the data was the problem all along.
Lesson 3: Even Trusted Sources Can Have Bad Data
These datasets came from Ultralytics — a highly reputable source. But V3 was likely created for a different purpose (negative mining?) and had systematic under-labeling. Trust, but verify. Always.
Lesson 4: Quality Beats Quantity
22k clean images → 93.3% mAP 37k mixed images → 79.1% mAP
The Breakthrough: cls=2.0 Optimization
After removing V3 from the dataset, I made a single hyperparameter change and trained for 200 epochs. The results exceeded all expectations.
Varroa detection improved from 87.2% → 93.6% (+6.4 percentage points)
The Single Change That Made All the Difference
Baseline Model
Optimized Model CHAMPION
Performance Gains
Visual Performance Comparison
The Impact: 498 More Varroa Detected
- Caught 498 varroa the baseline model missed (758 → 260 false negatives)
- Reduced false alarms by 221 (475 → 254 false positives)
- Achieved 93% true positive rate on varroa detection
In production terms: Out of every 100 varroa mites, 93 are now caught instead of 81. That's 12 more per 100 — meaningful for hive health.
What Made the Difference
Classification Loss Weight: 0.5 → 2.0
Increasing cls from 0.5 to 2.0 made classification errors 4× more expensive during training, forcing the model to learn sharper decision boundaries between bees and varroa. The model became much more confident at distinguishing varroa from background.
Extended Training: 50 → 200 Epochs
Training for 200 epochs allowed the model to fully converge and extract maximum performance from the clean V1+V2 dataset. Smooth, stable learning curves with no overfitting — the model plateaued around epoch 150.
Clean Data Foundation
Training exclusively on V1+V2 — with a healthy 2.16:1 bee:varroa ratio — provided consistent, reliable training signals with no conflicting annotations from poisoned sources.
Complete Model Evolution
| Model | Architecture | Dataset | Config | mAP | Varroa mAP | Status |
|---|---|---|---|---|---|---|
| Final Champion | YOLOv8s | V1+V2 | cls=2.0, 200ep | 96.5% | 93.6% | DEPLOYED |
| Baseline | YOLOv8s | V1+V2 | cls=0.5, 50ep | 93.3% | 87.2% | Good |
| YOLO11s Attempt | YOLO11s | V1+V2+V3 | cls=2.0, 200ep | 91.5% | 85.3% | V3 poisoned |
| 3-Dataset Trial | YOLOv8s | V1+V2+V3 | cls=0.5, 50ep | 89.1% | 80.5% | V3 poisoned |
| 4-Dataset Disaster | YOLOv8s | V1+V2+V3+V4 | cls=0.5, 50ep | 79.1% | 61.0% | Unusable |
🐝 The Moral of the Story
"I spent hours tweaking learning rates, batch sizes, architectures, and augmentations — convinced it was a config problem. It wasn't. It was a data problem. Always check your data first. Always."
"That was a pain in the arse."