🐝 The Great Dataset Debug Saga

How I discovered that more data isn't always better, and sometimes your best friend (Ultralytics) accidentally gives you the worst dataset

Date: February 2026
Mission: Train a YOLOv8 model to detect varroa mites on bees
Expected: More data → Better performance
Reality: More data → Performance catastrophe
Coffee consumed: Way too much

This is the complete story of how I went from 93.3% mAP to 79.1% mAP by adding "trusted" datasets, tore my hair out debugging configs for hours, and finally discovered the silent killer: incomplete annotations masquerading as good data. Then fixed it all with a single hyperparameter change.

The Starting Point: The Datasets

I had four datasets from Ultralytics, all seemingly legitimate. The numbers told a different story:

DatasetImagesBeesVarroaBee:Varroa RatioStatus
V18,6808,7265,149 1.69:1 Good
V213,49213,5515,149 2.63:1 Good
V39,73642,1181,258 33.48:1 Poisoned
V45,1445,2355,851 0.89:1 Label conflict

The Smoking Gun

V3 had only 0.13 varroa per image compared to V1's 0.59 and V2's 0.38. This wasn't just imbalanced — it was suspiciously low, strongly suggesting systematic under-labeling of varroa in V3.

The Training Journey

Stage 1: The Baseline (V1 + V2)

Model: YOLOv8s, 50 epochs, cls=0.5

Results: 93.3% mAP, 87.2% varroa detection

Excellent! Everything working beautifully — high confidence, low false positives, stable training curves.

Stage 2: Adding V3 (V1 + V2 + V3)

Model: YOLOv8s, 50 epochs, cls=0.5

Results: 89.1% mAP, 80.5% varroa detection

−4.2% mAP, −6.7% varroa — I added 45% more training data and performance decreased. My reaction: "Must be a config problem. Let me try different learning rates..."

Stage 3: Trying YOLO11s (Desperation)

Model: YOLO11s, 200 epochs, cls=2.0, mixup=0.15

Dataset: V1 + V2 + V3

Results: 91.5% mAP, 85.3% varroa detection

Better than YOLOv8s with the same data, but still worse than the baseline. 4× more training and a newer architecture couldn't fix bad data.

Stage 4: The Disaster (V1 + V2 + V3 + V4)

Model: YOLOv8s, 50 epochs, cls=0.5

Results: 79.1% mAP, 61.0% varroa detection

−14.2% mAP, −26.2% varroa from baseline. 36% false negatives — missing more than 1 in 3 varroa mites!

"Something is very wrong. Time to stop changing configs and actually look at the data."

The Investigation

V1 + V2 BASELINE

Overall mAP93.3%
Varroa mAP87.2%
Bee:Varroa Ratio2.16:1
Training Data22k images

V1 + V2 + V3

Overall mAP89.1%
Varroa mAP80.5%
Bee:Varroa Ratio5.57:1
Training Data32k images (+45%)

V1 + V2 + V3 + V4

Overall mAP79.1%
Varroa mAP61.0%
Bee:Varroa Ratio4.0:1
Training Data37k images (+67%)
Evidence #1: Varroa Detection Collapse
V1+V2:           80.6% varroa detected 
V1+V2+V3:        75.8% varroa detected   (-4.8%)
V1+V2+V3+V4:     64.0% varroa detected  (-16.6%)
Evidence #2: False Negatives Skyrocketed
V1+V2:           758 varroa missed (19%)
V1+V2+V3:      1,049 varroa missed (24%)
V1+V2+V3+V4:   1,272 varroa missed (36%) 

Why: V3 had thousands of unlabeled varroa. The model learned to call varroa "background" because that's what the data told it to do.

Evidence #3: Better Architecture Couldn't Save Bad Data
YOLOv8s, 50 epochs,  V1+V2:      93.3% mAP  (simple config, 2 datasets)
YOLO11s, 200 epochs, V1+V2+V3:   91.5% mAP  (4x training, better model, 3 datasets)

The Pattern

The more data I added, the worse the model performed. V3 was actively teaching the model the wrong thing — that unlabeled varroa instances are just background.

"More data should always help — unless the new data is actively teaching the model the wrong thing."

Lessons Learned

Lesson 1: Always Audit New Datasets

Before merging datasets, run these checks: class distribution, instances per image ratio, bounding box size distributions, and a visual spot-check of 50+ random images. Five minutes of data auditing saves days of debugging.

Lesson 2: Trust Your Gut When Performance Drops

When performance degrades unexpectedly, the cause is usually: 70% data quality issues, 20% implementation bugs, 10% hyperparameter problems. Not the other way around. I wasted hours tweaking configs when the data was the problem all along.

Lesson 3: Even Trusted Sources Can Have Bad Data

These datasets came from Ultralytics — a highly reputable source. But V3 was likely created for a different purpose (negative mining?) and had systematic under-labeling. Trust, but verify. Always.

Lesson 4: Quality Beats Quantity

22k clean images   → 93.3% mAP 
37k mixed images   → 79.1% mAP 

The Breakthrough: cls=2.0 Optimization

After removing V3 from the dataset, I made a single hyperparameter change and trained for 200 epochs. The results exceeded all expectations.

Varroa detection improved from 87.2% → 93.6% (+6.4 percentage points)

The Single Change That Made All the Difference

Baseline Model

Configcls=0.5
Epochs50
Overall mAP93.3%
Varroa mAP87.2%

Optimized Model CHAMPION

Configcls=2.0
Epochs200
Overall mAP96.5%
Varroa mAP93.6%

Performance Gains

+3.2%
Overall mAP
93.3% → 96.5%
+6.4%
Varroa mAP
87.2% → 93.6%
+12.2%
Detection Rate
80.6% → 92.8%
−65.7%
False Negatives
758 → 260 missed
−46.5%
False Positives
475 → 254 errors
99.4%
Bee Detection
Maintained perfection

Visual Performance Comparison

Overall mAP
Baseline (cls=0.5)
93.3%
Optimized (cls=2.0)
96.5%
Varroa mAP
Baseline
87.2%
Optimized
93.6%
Detection Accuracy (per 100 varroa mites)
Baseline
81 detected
Optimized
93 detected

The Impact: 498 More Varroa Detected

In production terms: Out of every 100 varroa mites, 93 are now caught instead of 81. That's 12 more per 100 — meaningful for hive health.

What Made the Difference

Classification Loss Weight: 0.5 → 2.0

Increasing cls from 0.5 to 2.0 made classification errors 4× more expensive during training, forcing the model to learn sharper decision boundaries between bees and varroa. The model became much more confident at distinguishing varroa from background.

Extended Training: 50 → 200 Epochs

Training for 200 epochs allowed the model to fully converge and extract maximum performance from the clean V1+V2 dataset. Smooth, stable learning curves with no overfitting — the model plateaued around epoch 150.

Clean Data Foundation

Training exclusively on V1+V2 — with a healthy 2.16:1 bee:varroa ratio — provided consistent, reliable training signals with no conflicting annotations from poisoned sources.

Complete Model Evolution

ModelArchitectureDatasetConfigmAPVarroa mAPStatus
Final ChampionYOLOv8sV1+V2cls=2.0, 200ep 96.5%93.6% DEPLOYED
BaselineYOLOv8sV1+V2cls=0.5, 50ep 93.3%87.2%Good
YOLO11s AttemptYOLO11sV1+V2+V3cls=2.0, 200ep 91.5%85.3%V3 poisoned
3-Dataset TrialYOLOv8sV1+V2+V3cls=0.5, 50ep 89.1%80.5%V3 poisoned
4-Dataset DisasterYOLOv8sV1+V2+V3+V4cls=0.5, 50ep 79.1%61.0%Unusable

🐝 The Moral of the Story

"I spent hours tweaking learning rates, batch sizes, architectures, and augmentations — convinced it was a config problem. It wasn't. It was a data problem. Always check your data first. Always."
"That was a pain in the arse."