🐝 Dataset Preparation: The Journey to Clean Data

Building a production-ready varroa detection model from multiple datasets — the challenges, solutions, and lessons learned

37,052Total Images
27,529Training Images
9,523Validation Images
17,407Varroa Instances

The Dataset Challenge

Training an accurate object detection model requires high-quality, consistently labeled data. My journey began with four datasets from different sources, each with its own labeling conventions.

DatasetTrainValidationTestLabel Format
V16,0751,737868Bee=0, Varroa=1 ✓
V28,2171,8673,408Bee=0, Varroa=1 ✓
V38,0931,175468Bee=0, Varroa=1 ✓
V45,14400Varroa=0, Bee=1

The Label Conflict

Dataset V4 used the opposite labeling convention. It labeled varroa as class 0 and bees as class 1, while all other datasets and my standard convention used class 0 for bees and class 1 for varroa. This required a label conversion step before merging.

Dataset Merging Flow

graph TB A[Dataset Collection] --> B[V1: Bee=0, Varroa=1] A --> C[V2: Bee=0, Varroa=1] A --> D[V3: Bee=0, Varroa=1] A --> E[V4: Varroa=0, Bee=1] B --> F[Master Dataset] C --> F D --> F E --> G[Label Conversion Required] G --> F F --> H[27,529 train images] F --> I[9,523 val images] style E fill:#4d1f1f style G fill:#4d3d1f style F fill:#1f4d2e style A fill:#1a1f3a style B fill:#1a2d3a style C fill:#1a2d3a style D fill:#1a2d3a style H fill:#1a3d2a style I fill:#1a3d2a

The Label Conversion Bug

My first attempt to fix the V4 labels had a critical bug:

mapping = {'0': '1', '3': '0'}  # Wrong approach!
for line in lines:
    parts = line.strip().split()
    old_id = parts[0]
    if old_id in mapping:
        parts[0] = mapping[old_id]  # Bug: Sequential mapping causes collision

What Went Wrong

Before: 5,851 bees (class 0) + 5,235 varroa (class 1)

After bug: 0 bees + 11,086 varroa — everything merged into class 1!

The script converted all class 0 → class 1 first, then had nothing left to convert the other direction. Bee labels were completely lost.

The Fix: Atomic Swap

for line in lines:
    parts = line.strip().split()
    if not parts:
        continue
    old_id = parts[0]
    if old_id == '0':
        parts[0] = '1'   # varroa (was 0) → 1
        modified = True
    elif old_id == '1':
        parts[0] = '0'   # bee (was 1) → 0
        modified = True

Result

All 5,144 V4 files correctly converted with bee=0, varroa=1. Both classes preserved.

Class Distribution Analysis

SplitBees (Class 0)Varroa (Class 1)Ratio
Training55,28713,3974.13:1
Validation14,3434,0103.58:1
Total69,63017,4074.00:1

Analysis

The 4:1 bee-to-varroa ratio is a significant but realistic class imbalance — in real hive frames, bees genuinely outnumber varroa mites. The strategy was to start with default training and apply class weighting only if varroa detection performance proved insufficient.

Class Distribution

%%{init: {'theme':'dark'}}%% pie title Training Set Class Distribution "Bees (55,287)" : 55287 "Varroa (13,397)" : 13397

Image Size & IMGSZ Analysis

Understanding native image resolution is critical for choosing the right imgsz parameter. Upscaling low-resolution images creates artificial detail and slows training without improving accuracy.

Dataset Overview
37,053
Total Images
0.70
Mean Aspect Ratio

Short Side Distribution (px)

P10
160
P25
160
P50
160
P75
160
P90
512
P95
512
Resolution Distribution
Image size vs quality
50% of images are <200px — most are close-up macro bee shots with clear varroa. Small size does not mean inferior quality!
Very Low (<160px) 1,407 (3.8%)
3.8%
Low (160–319px) 27,900 (75.3%)
75.3%
Medium (320–639px) 6,963 (18.8%)
18.8%
High (640–1279px) 709 (1.9%)
1.9%
Very High (≥1280px) 74 (0.2%)
0.2%
Aspect Ratio & Orientation
0.70
Mean
0.57
Median
0.5–3.15
Range

Orientation Distribution

Portrait 77%
Square 19%
Land 5%
Portrait (h>w): 28,543 images
Landscape (w>h): 1,705 images
Square: 6,874 images
Extreme ratios (<0.5 or >2.0): 9 images (0.0%)
IMGSZ Impact Analysis
640
↓ Down: 2.0%
↑ Up: 97.9%
✓ Exact: 0.1%
768
↓ Down: 1.6%
↑ Up: 98.4%
✓ Exact: 0.0%
1024
↓ Down: 1.5%
↑ Up: 98.5%
✓ Exact: 0.0%

Key Finding: 75% of Images are 160px

Using common YOLO defaults like imgsz=640 would upscale most images by 4×, creating interpolation artifacts and teaching the model from synthetic detail rather than real image content.

Upscaling Impact by imgsz Setting

imgszUpscaled ImagesUpscale FactorAssessment
1600%1.0× No upscaling
32079.8%2.0× Manageable
51279.8%3.2× Significant upscaling
64097.9%4.0× Excessive upscaling
Baseline Test Configuration
IMGSZ = 320
Minimal upscaling preserves original image quality
Varroa median box size of 47px — well above the 16px detection threshold
Faster training and inference vs larger sizes
Note: 79.8% of images will still be upscaled 2× — source image quality matters

Varroa Bounding Box Analysis

Small objects (<16px) are notoriously difficult for YOLO to detect. This analysis determines how large varroa bounding boxes appear at different imgsz settings.

imgszMedian Box Size10th PercentileBoxes <16pxAssessment
16023.5px15.0px13.3% Too many tiny boxes
32046.9px30.0px 0.7% Optimal balance
51275.1px48.0px0.3% Good but more upscaling
64093.9px60.0px0.1% Excessive upscaling

imgsz Decision Tree

graph TB A[imgsz Selection] --> B{Varroa Box Analysis} B --> C[imgsz=160: 13.3% boxes < 16px] B --> D[imgsz=320: 0.7% boxes < 16px] B --> E[imgsz=640: 0.1% boxes < 16px] C --> F[ High small-object risk] D --> G[ Minimal risk + Less upscaling] E --> H[ 4x upscaling for 80% of images] G --> I[CHOSEN: imgsz=320] style I fill:#1f4d2e,stroke:#28a745,stroke-width:3px style G fill:#1a3d2a style F fill:#4d1f1f style H fill:#4d3d1f style A fill:#1a1f3a style B fill:#1a2d3a style C fill:#1a2430 style D fill:#1a2430 style E fill:#1a2430

Model & Training Setup

Model Selection: YOLOv8s

FactorYOLOv8sYOLOv11s
Maturity Battle-tested Relatively new
Documentation ExtensiveGrowing
Small Object Detection Proven excellentSimilar/slightly better
Inference SpeedFast 10–15% faster
mAP PerformanceExcellent1–2% better
Best ForProduction reliabilityResearch

Decision: YOLOv8s

For a production varroa detection system, stability and proven performance outweigh marginal speed improvements. YOLOv8s handles 30–90px objects excellently and has a mature ecosystem.

Baseline-First Training Strategy

graph TD A[Start: Baseline Training] --> B[Run 1: Minimal Config
YOLOv8s, imgsz=320, batch=16] B --> C{Evaluate Results} C --> D[Check Overall mAP50] C --> E[Check Varroa Recall] C --> F[Check Bee Precision] D --> G{mAP > 0.7?} E --> H{Varroa Recall > 0.7?} F --> I{Bee False Positives?} G -->|No| J[Adjust Learning Rate / Increase Epochs] H -->|No| K[Add Class Weights: Varroa=1.0, Bee=0.24] I -->|Yes| L[Try imgsz=512 for More Detail] G -->|Yes| M[Success!] H -->|Yes| M I -->|No| M M --> N[Deploy & Monitor] style A fill:#1f4d2e,stroke:#28a745 style M fill:#1f4d2e,stroke:#28a745 style B fill:#1a2d3a style K fill:#4d3d1f style L fill:#4d3d1f style N fill:#1a3d2a

Baseline Configuration

from ultralytics import YOLO

model = YOLO('yolov8s.pt')
results = model.train(
    data='varroa.yaml',
    epochs=100,
    imgsz=320,
    batch=16,
    patience=20,
    project='varroa_detection',
    name='baseline_v8s_imgsz320'
)

Why Start Simple?

  • Establishes a performance baseline without confounding variables
  • Reveals whether the data quality itself is sufficient
  • Makes it clear what each subsequent optimization actually contributes
  • Prevents premature tuning of the wrong metrics

Key Takeaways

Critical Success Factors

  • Label standardization is essential: Even simple mapping errors can destroy entire datasets
  • Analyze before training: Understanding image sizes prevents wasted compute
  • Match imgsz to data reality: Don't blindly use defaults — 640px isn't always optimal
  • Small object detection has thresholds: Keep bounding boxes above 16px when possible
  • Baseline first, optimize later: Systematic experimentation beats premature tuning

Common Pitfalls to Avoid

  • Sequential label mapping without collision checking
  • Assuming all datasets follow the same convention
  • Using imgsz=640 without analyzing native resolutions
  • Ignoring class imbalance until after training fails
  • Not validating label conversions with sample visualizations