The Great Dataset Debug Saga | Berry Jerry Bees ML Journey

Date: February 2026
Mission: Train a YOLOv8 model to detect varroa mites on bees
Expected: More data → Better performance
Reality: More data → Performance catastrophe
Coffee consumed: Way too much

This is the complete story of how I went from 93.3% mAP to 79.1% mAP by adding "trusted" datasets, tore my hair out debugging configs for hours, and finally discovered the silent killer: incomplete annotations masquerading as good data. Then fixed it all with a single hyperparameter change.

The Starting Point: The Datasets

I had four datasets from Ultralytics, all seemingly legitimate. The numbers told a different story:

Dataset	Images	Bees	Varroa	Bee:Varroa Ratio	Status
V1	8,680	8,726	5,149	1.69:1	Good
V2	13,492	13,551	5,149	2.63:1	Good
V3	9,736	42,118	1,258	33.48:1	Poisoned
V4	5,144	5,235	5,851	0.89:1	Label conflict

The Smoking Gun

V3 had only 0.13 varroa per image compared to V1's 0.59 and V2's 0.38. This wasn't just imbalanced — it was suspiciously low, strongly suggesting systematic under-labeling of varroa in V3.

The Training Journey

Stage 1: The Baseline (V1 + V2)

Model: YOLOv8s, 50 epochs, cls=0.5

Results: 93.3% mAP, 87.2% varroa detection

Excellent! Everything working beautifully — high confidence, low false positives, stable training curves.

Stage 2: Adding V3 (V1 + V2 + V3)

Model: YOLOv8s, 50 epochs, cls=0.5

Results: 89.1% mAP, 80.5% varroa detection

−4.2% mAP, −6.7% varroa — I added 45% more training data and performance decreased. My reaction: "Must be a config problem. Let me try different learning rates..."

Stage 3: Trying YOLO11s (Desperation)

Model: YOLO11s, 200 epochs, cls=2.0, mixup=0.15

Dataset: V1 + V2 + V3

Results: 91.5% mAP, 85.3% varroa detection

Better than YOLOv8s with the same data, but still worse than the baseline. 4× more training and a newer architecture couldn't fix bad data.

Stage 4: The Disaster (V1 + V2 + V3 + V4)

Model: YOLOv8s, 50 epochs, cls=0.5

Results: 79.1% mAP, 61.0% varroa detection

−14.2% mAP, −26.2% varroa from baseline. 36% false negatives — missing more than 1 in 3 varroa mites!

"Something is very wrong. Time to stop changing configs and actually look at the data."

The Investigation

V1 + V2 BASELINE

Overall mAP93.3%

Varroa mAP87.2%

Bee:Varroa Ratio2.16:1

Training Data22k images

V1 + V2 + V3

Overall mAP89.1%

Varroa mAP80.5%

Bee:Varroa Ratio5.57:1

Training Data32k images (+45%)

V1 + V2 + V3 + V4

Overall mAP79.1%

Varroa mAP61.0%

Bee:Varroa Ratio4.0:1

Training Data37k images (+67%)

Evidence #1: Varroa Detection Collapse

V1+V2:           80.6% varroa detected 
V1+V2+V3:        75.8% varroa detected   (-4.8%)
V1+V2+V3+V4:     64.0% varroa detected  (-16.6%)

Evidence #2: False Negatives Skyrocketed

V1+V2:           758 varroa missed (19%)
V1+V2+V3:      1,049 varroa missed (24%)
V1+V2+V3+V4:   1,272 varroa missed (36%)

Why: V3 had thousands of unlabeled varroa. The model learned to call varroa "background" because that's what the data told it to do.

Evidence #3: Better Architecture Couldn't Save Bad Data

YOLOv8s, 50 epochs,  V1+V2:      93.3% mAP  (simple config, 2 datasets)
YOLO11s, 200 epochs, V1+V2+V3:   91.5% mAP  (4x training, better model, 3 datasets)

The Pattern

The more data I added, the worse the model performed. V3 was actively teaching the model the wrong thing — that unlabeled varroa instances are just background.

"More data should always help — unless the new data is actively teaching the model the wrong thing."

Lessons Learned

Lesson 1: Always Audit New Datasets

Before merging datasets, run these checks: class distribution, instances per image ratio, bounding box size distributions, and a visual spot-check of 50+ random images. Five minutes of data auditing saves days of debugging.

Lesson 2: Trust Your Gut When Performance Drops

When performance degrades unexpectedly, the cause is usually: 70% data quality issues, 20% implementation bugs, 10% hyperparameter problems. Not the other way around. I wasted hours tweaking configs when the data was the problem all along.

Lesson 3: Even Trusted Sources Can Have Bad Data

These datasets came from Ultralytics — a highly reputable source. But V3 was likely created for a different purpose (negative mining?) and had systematic under-labeling. Trust, but verify. Always.

Lesson 4: Quality Beats Quantity

22k clean images   → 93.3% mAP 
37k mixed images   → 79.1% mAP

The Breakthrough: cls=2.0 Optimization

After removing V3 from the dataset, I made a single hyperparameter change and trained for 200 epochs. The results exceeded all expectations.

Varroa detection improved from 87.2% → 93.6% (+6.4 percentage points)

The Single Change That Made All the Difference

Baseline Model

Configcls=0.5

Epochs50

Overall mAP93.3%

Varroa mAP87.2%

→

Optimized Model CHAMPION

Configcls=2.0

Epochs200

Overall mAP96.5%

Varroa mAP93.6%

Performance Gains

+3.2%

Overall mAP

93.3% → 96.5%

+6.4%

Varroa mAP

87.2% → 93.6%

+12.2%

Detection Rate

80.6% → 92.8%

−65.7%

False Negatives

758 → 260 missed

−46.5%

False Positives

475 → 254 errors

99.4%

Bee Detection

Maintained perfection

Visual Performance Comparison

Overall mAP

Baseline (cls=0.5)

93.3%

Optimized (cls=2.0)

96.5%

Varroa mAP

Baseline

87.2%

Optimized

93.6%

Detection Accuracy (per 100 varroa mites)

Baseline

81 detected

Optimized

93 detected

The Impact: 498 More Varroa Detected

Caught 498 varroa the baseline model missed (758 → 260 false negatives)
Reduced false alarms by 221 (475 → 254 false positives)
Achieved 93% true positive rate on varroa detection

In production terms: Out of every 100 varroa mites, 93 are now caught instead of 81. That's 12 more per 100 — meaningful for hive health.

What Made the Difference

Classification Loss Weight: 0.5 → 2.0

Increasing cls from 0.5 to 2.0 made classification errors 4× more expensive during training, forcing the model to learn sharper decision boundaries between bees and varroa. The model became much more confident at distinguishing varroa from background.

Extended Training: 50 → 200 Epochs

Training for 200 epochs allowed the model to fully converge and extract maximum performance from the clean V1+V2 dataset. Smooth, stable learning curves with no overfitting — the model plateaued around epoch 150.

Clean Data Foundation

Training exclusively on V1+V2 — with a healthy 2.16:1 bee:varroa ratio — provided consistent, reliable training signals with no conflicting annotations from poisoned sources.

Complete Model Evolution

Model	Architecture	Dataset	Config	mAP	Varroa mAP	Status
Final Champion	YOLOv8s	V1+V2	cls=2.0, 200ep	96.5%	93.6%	DEPLOYED
Baseline	YOLOv8s	V1+V2	cls=0.5, 50ep	93.3%	87.2%	Good
YOLO11s Attempt	YOLO11s	V1+V2+V3	cls=2.0, 200ep	91.5%	85.3%	V3 poisoned
3-Dataset Trial	YOLOv8s	V1+V2+V3	cls=0.5, 50ep	89.1%	80.5%	V3 poisoned
4-Dataset Disaster	YOLOv8s	V1+V2+V3+V4	cls=0.5, 50ep	79.1%	61.0%	Unusable

🐝 The Moral of the Story

"I spent hours tweaking learning rates, batch sizes, architectures, and augmentations — convinced it was a config problem. It wasn't. It was a data problem. Always check your data first. Always."

"That was a pain in the arse."
— Berry Jerry Bees, February 2026