Dataset Preparation Journey | Berry Jerry Bees

37,052Total Images

27,529Training Images

9,523Validation Images

17,407Varroa Instances

The Dataset Challenge

Training an accurate object detection model requires high-quality, consistently labeled data. My journey began with four datasets from different sources, each with its own labeling conventions.

Dataset	Train	Validation	Test	Label Format
V1	6,075	1,737	868	Bee=0, Varroa=1 ✓
V2	8,217	1,867	3,408	Bee=0, Varroa=1 ✓
V3	8,093	1,175	468	Bee=0, Varroa=1 ✓
V4	5,144	0	0	Varroa=0, Bee=1

The Label Conflict

Dataset V4 used the opposite labeling convention. It labeled varroa as class 0 and bees as class 1, while all other datasets and my standard convention used class 0 for bees and class 1 for varroa. This required a label conversion step before merging.

Dataset Merging Flow

graph TB A[Dataset Collection] --> B[V1: Bee=0, Varroa=1] A --> C[V2: Bee=0, Varroa=1] A --> D[V3: Bee=0, Varroa=1] A --> E[V4: Varroa=0, Bee=1] B --> F[Master Dataset] C --> F D --> F E --> G[Label Conversion Required] G --> F F --> H[27,529 train images] F --> I[9,523 val images] style E fill:#4d1f1f style G fill:#4d3d1f style F fill:#1f4d2e style A fill:#1a1f3a style B fill:#1a2d3a style C fill:#1a2d3a style D fill:#1a2d3a style H fill:#1a3d2a style I fill:#1a3d2a

The Label Conversion Bug

My first attempt to fix the V4 labels had a critical bug:

mapping = {'0': '1', '3': '0'}  # Wrong approach!
for line in lines:
    parts = line.strip().split()
    old_id = parts[0]
    if old_id in mapping:
        parts[0] = mapping[old_id]  # Bug: Sequential mapping causes collision

What Went Wrong

Before: 5,851 bees (class 0) + 5,235 varroa (class 1)

After bug: 0 bees + 11,086 varroa — everything merged into class 1!

The script converted all class 0 → class 1 first, then had nothing left to convert the other direction. Bee labels were completely lost.

The Fix: Atomic Swap

for line in lines:
    parts = line.strip().split()
    if not parts:
        continue
    old_id = parts[0]
    if old_id == '0':
        parts[0] = '1'   # varroa (was 0) → 1
        modified = True
    elif old_id == '1':
        parts[0] = '0'   # bee (was 1) → 0
        modified = True

Result

All 5,144 V4 files correctly converted with bee=0, varroa=1. Both classes preserved.

Class Distribution Analysis

Split	Bees (Class 0)	Varroa (Class 1)	Ratio
Training	55,287	13,397	4.13:1
Validation	14,343	4,010	3.58:1
Total	69,630	17,407	4.00:1

Analysis

The 4:1 bee-to-varroa ratio is a significant but realistic class imbalance — in real hive frames, bees genuinely outnumber varroa mites. The strategy was to start with default training and apply class weighting only if varroa detection performance proved insufficient.

Class Distribution

%%{init: {'theme':'dark'}}%% pie title Training Set Class Distribution "Bees (55,287)" : 55287 "Varroa (13,397)" : 13397

Image Size & IMGSZ Analysis

Understanding native image resolution is critical for choosing the right imgsz parameter. Upscaling low-resolution images creates artificial detail and slows training without improving accuracy.

Dataset Overview

37,053

Total Images

0.70

Mean Aspect Ratio

Short Side Distribution (px)

P10

160

P25

160

P50

160

P75

160

P90

512

P95

512

Resolution Distribution

Image size vs quality

50% of images are <200px — most are close-up macro bee shots with clear varroa. Small size does not mean inferior quality!

Very Low (<160px) 1,407 (3.8%)

3.8%

Low (160–319px) 27,900 (75.3%)

75.3%

Medium (320–639px) 6,963 (18.8%)

18.8%

High (640–1279px) 709 (1.9%)

1.9%

Very High (≥1280px) 74 (0.2%)

0.2%

Aspect Ratio & Orientation

0.70

Mean

0.57

Median

0.5–3.15

Range

Orientation Distribution

Portrait 77%

Square 19%

Land 5%

Portrait (h>w): 28,543 images

Landscape (w>h): 1,705 images

Square: 6,874 images

Extreme ratios (<0.5 or >2.0): 9 images (0.0%)

IMGSZ Impact Analysis

512

↓ Down: 2.1%

↑ Up: 79.8%

✓ Exact: 18.1%

RECOMMENDED

640

↓ Down: 2.0%

↑ Up: 97.9%

✓ Exact: 0.1%

768

↓ Down: 1.6%

↑ Up: 98.4%

✓ Exact: 0.0%

1024

↓ Down: 1.5%

↑ Up: 98.5%

✓ Exact: 0.0%

Key Finding: 75% of Images are 160px

Using common YOLO defaults like imgsz=640 would upscale most images by 4×, creating interpolation artifacts and teaching the model from synthetic detail rather than real image content.

Upscaling Impact by imgsz Setting

imgsz	Upscaled Images	Upscale Factor	Assessment
160	0%	1.0×	No upscaling
320	79.8%	2.0×	Manageable
512	79.8%	3.2×	Significant upscaling
640	97.9%	4.0×	Excessive upscaling

Baseline Test Configuration

IMGSZ = 320

Minimal upscaling preserves original image quality
Varroa median box size of 47px — well above the 16px detection threshold
Faster training and inference vs larger sizes
Note: 79.8% of images will still be upscaled 2× — source image quality matters

Varroa Bounding Box Analysis

Small objects (<16px) are notoriously difficult for YOLO to detect. This analysis determines how large varroa bounding boxes appear at different imgsz settings.

imgsz	Median Box Size	10th Percentile	Boxes <16px	Assessment
160	23.5px	15.0px	13.3%	Too many tiny boxes
320	46.9px	30.0px	0.7%	Optimal balance
512	75.1px	48.0px	0.3%	Good but more upscaling
640	93.9px	60.0px	0.1%	Excessive upscaling

imgsz Decision Tree

graph TB A[imgsz Selection] --> B{Varroa Box Analysis} B --> C[imgsz=160: 13.3% boxes < 16px] B --> D[imgsz=320: 0.7% boxes < 16px] B --> E[imgsz=640: 0.1% boxes < 16px] C --> F[ High small-object risk] D --> G[ Minimal risk + Less upscaling] E --> H[ 4x upscaling for 80% of images] G --> I[CHOSEN: imgsz=320] style I fill:#1f4d2e,stroke:#28a745,stroke-width:3px style G fill:#1a3d2a style F fill:#4d1f1f style H fill:#4d3d1f style A fill:#1a1f3a style B fill:#1a2d3a style C fill:#1a2430 style D fill:#1a2430 style E fill:#1a2430

Model & Training Setup

Model Selection: YOLOv8s

Factor	YOLOv8s	YOLOv11s
Maturity	Battle-tested	Relatively new
Documentation	Extensive	Growing
Small Object Detection	Proven excellent	Similar/slightly better
Inference Speed	Fast	10–15% faster
mAP Performance	Excellent	1–2% better
Best For	Production reliability	Research

Decision: YOLOv8s

For a production varroa detection system, stability and proven performance outweigh marginal speed improvements. YOLOv8s handles 30–90px objects excellently and has a mature ecosystem.

Baseline-First Training Strategy

graph TD A[Start: Baseline Training] --> B[Run 1: Minimal Config
YOLOv8s, imgsz=320, batch=16] B --> C{Evaluate Results} C --> D[Check Overall mAP50] C --> E[Check Varroa Recall] C --> F[Check Bee Precision] D --> G{mAP > 0.7?} E --> H{Varroa Recall > 0.7?} F --> I{Bee False Positives?} G -->|No| J[Adjust Learning Rate / Increase Epochs] H -->|No| K[Add Class Weights: Varroa=1.0, Bee=0.24] I -->|Yes| L[Try imgsz=512 for More Detail] G -->|Yes| M[Success!] H -->|Yes| M I -->|No| M M --> N[Deploy & Monitor] style A fill:#1f4d2e,stroke:#28a745 style M fill:#1f4d2e,stroke:#28a745 style B fill:#1a2d3a style K fill:#4d3d1f style L fill:#4d3d1f style N fill:#1a3d2a

Baseline Configuration

from ultralytics import YOLO

model = YOLO('yolov8s.pt')
results = model.train(
    data='varroa.yaml',
    epochs=100,
    imgsz=320,
    batch=16,
    patience=20,
    project='varroa_detection',
    name='baseline_v8s_imgsz320'
)

Why Start Simple?

Establishes a performance baseline without confounding variables
Reveals whether the data quality itself is sufficient
Makes it clear what each subsequent optimization actually contributes
Prevents premature tuning of the wrong metrics

Key Takeaways

Critical Success Factors

Label standardization is essential: Even simple mapping errors can destroy entire datasets
Analyze before training: Understanding image sizes prevents wasted compute
Match imgsz to data reality: Don't blindly use defaults — 640px isn't always optimal
Small object detection has thresholds: Keep bounding boxes above 16px when possible
Baseline first, optimize later: Systematic experimentation beats premature tuning

Common Pitfalls to Avoid

Sequential label mapping without collision checking
Assuming all datasets follow the same convention
Using imgsz=640 without analyzing native resolutions
Ignoring class imbalance until after training fails
Not validating label conversions with sample visualizations